Questions about KeyKOS

Arndt Schnabel

unread,

Apr 11, 2019, 1:05:10 PM4/11/19

to cap-talk

I'm not sure if this is the right place to ask these questions, but I hope so.

I could not find high-level documentation that answered this.

1. When a domain CALLs another domain, does the called domain run on its own meter or is its execution accounted for by the callers meter?

2. Same situation as in (1) but with FORK

3. With space banks, how did Tymshare/Keylogic sell memory access: "allocate up to x MB within the next month" vs "allocate up to x MB-months"? So, was there always a fixed memory limit or could you sell memory-seconds?

4. Can a domain allocate more memory to a child domain than it has itself (see https://esolangs.org/wiki/RarVM#Hierarchical_resource_models:_intent_vs_transfer)?

5. How closely does the set of domains resemble a network vs a hierarchy? Which parts of the system had synchronous vs asynchronous boundaries (were executed with CALL vs FORK)?

6. Were there significant parts of the software that were executed with both?

7. Was KeyKOS only used on timesharing systems or did later versions resemble personal operating systems? If yes, is there an overview of findings that were made when trying to connect the capability system with other KeyKOS instances?

8. Where is the best place to ask more questions?

Confuzzled,

- Arndt

Mark Miller

unread,

Apr 11, 2019, 9:16:13 PM4/11/19

to cap-...@googlegroups.com

Hi Arndt!

I'm very familiar with KeyKOS from reading and discussion. But I've never used it, so take my answers with a grain of salt.

KeyKOS folks (yes Bill, I'm looking at you ;) ),

Am I getting it right? Could you answer the questions below that I could not?

On Thu, Apr 11, 2019 at 10:05 AM Arndt Schnabel <ArndtS...@freenet.de> wrote:

I'm not sure if this is the right place to ask these questions, but I hope so.

I could not find high-level documentation that answered this.

1. When a domain CALLs another domain, does the called domain run on its own meter or is its execution accounted for by the callers meter?

The called domain runs on its own meter.

2. Same situation as in (1) but with FORK

Still, the called domain runs on its own meter.

3. With space banks, how did Tymshare/Keylogic sell memory access: "allocate up to x MB within the next month" vs "allocate up to x MB-months"? So, was there always a fixed memory limit or could you sell memory-seconds?

There was no primitive notion of memory-seconds. Rather, you could build such abstractions yourself by renting out the memory until you decide to evict.

4. Can a domain allocate more memory to a child domain than it has itself (see https://esolangs.org/wiki/RarVM#Hierarchical_resource_models:_intent_vs_transfer)?

Depends on what you mean by "has". In general, the rights that any domain has must be rights that were granted it by some domain that already had those rights. A parent can create a child occupying more memory than the parent occupies, but only by drawing on rights-to-occupy memory that the parent already had.

5. How closely does the set of domains resemble a network vs a hierarchy? Which parts of the system had synchronous vs asynchronous boundaries (were executed with CALL vs FORK)?

All of these were possible. I don't know what patterns were used in practice.

6. Were there significant parts of the software that were executed with both?

I don't know.

7. Was KeyKOS only used on timesharing systems or did later versions resemble personal operating systems?

KeyKOS was ported to the Omron, which was a personal system that did not see much use. It was also ported to Sun under the name GuardOS. You should also look at KeyKOS's descendants, Eros / CapROS, Coyotos, and seL4. Especially seL4 (!!!)

If yes, is there an overview of findings that were made when trying to connect the capability system with other KeyKOS instances?

AFAIK, for everything in the KeyKOS family (Gnosis / KeyKOS / GuardOS, Eros / CapROS, Coyotos, seL4), there was never any attempt to stitch multiple instances together into a distributed capability system. I dreamed of doing this via E on KeyKOS. I now dream of doing this via Dr.SES on seL4.

8. Where is the best place to ask more questions?

This is probably the best place.

Confuzzled,

Interesting word ;)

- Arndt

--

Cheers,
--MarkM

Mark Miller

unread,

Apr 11, 2019, 9:34:10 PM4/11/19

to cap-...@googlegroups.com

On Thu, Apr 11, 2019 at 6:16 PM Mark Miller <eri...@gmail.com> wrote:

Hi Arndt!
I'm very familiar with KeyKOS from reading and discussion. But I've never used it, so take my answers with a grain of salt.

KeyKOS folks (yes Bill, I'm looking at you ;) ),
Am I getting it right? Could you answer the questions below that I could not?

On Thu, Apr 11, 2019 at 10:05 AM Arndt Schnabel <ArndtS...@freenet.de> wrote:
I'm not sure if this is the right place to ask these questions, but I hope so.

I could not find high-level documentation that answered this.

1. When a domain CALLs another domain, does the called domain run on its own meter or is its execution accounted for by the callers meter?

The called domain runs on its own meter.

2. Same situation as in (1) but with FORK

Still, the called domain runs on its own meter.

3. With space banks, how did Tymshare/Keylogic sell memory access: "allocate up to x MB within the next month" vs "allocate up to x MB-months"? So, was there always a fixed memory limit or could you sell memory-seconds?

There was no primitive notion of memory-seconds. Rather, you could build such abstractions yourself by renting out the memory until you decide to evict.

4. Can a domain allocate more memory to a child domain than it has itself (see https://esolangs.org/wiki/RarVM#Hierarchical_resource_models:_intent_vs_transfer)?

Depends on what you mean by "has". In general, the rights that any domain has must be rights that were granted it by some domain that already had those rights. A parent can create a child occupying more memory than the parent occupies, but only by drawing on rights-to-occupy memory that the parent already had.

Sorry, I did not follow your link before I replied. I think I now understand the question, and the answer is yes.

You can make a child meter with more alleged capacity than its parent. Each unit in the child was also a claim against the parent. Using a unit decremented the balances of all meters from the meter being used up its ancestor chain to the root meter. (The actual implementation did this in O(1) time IIRC.) If any of those meters reached zero, then the process drawing on that meter was suspended and some meter keeper was invoked. However, I don't remember which meter keeper: the keeper of the child being used or the keeper of the ancestor that ran out?

This is like fractional reserve and inspired the Fractal Reserve Bank that we did in Joule at the Agorics of the 1990s.

http://erights.org/history/joule/MANUAL.BK8.pdf

now safely protected by an expired patent

https://patents.google.com/patent/US6161121

5. How closely does the set of domains resemble a network vs a hierarchy? Which parts of the system had synchronous vs asynchronous boundaries (were executed with CALL vs FORK)?

All of these were possible. I don't know what patterns were used in practice.

6. Were there significant parts of the software that were executed with both?

I don't know.

7. Was KeyKOS only used on timesharing systems or did later versions resemble personal operating systems?

KeyKOS was ported to the Omron, which was a personal system that did not see much use. It was also ported to Sun under the name GuardOS. You should also look at KeyKOS's descendants, Eros / CapROS, Coyotos, and seL4. Especially seL4 (!!!)

If yes, is there an overview of findings that were made when trying to connect the capability system with other KeyKOS instances?

AFAIK, for everything in the KeyKOS family (Gnosis / KeyKOS / GuardOS, Eros / CapROS, Coyotos, seL4), there was never any attempt to stitch multiple instances together into a distributed capability system. I dreamed of doing this via E on KeyKOS. I now dream of doing this via Dr.SES on seL4.

8. Where is the best place to ask more questions?

This is probably the best place.

Confuzzled,

Interesting word ;)

- Arndt

--
Cheers,
--MarkM

--

Cheers,
--MarkM

Matt Rice

unread,

Apr 11, 2019, 11:14:08 PM4/11/19

to cap-...@googlegroups.com

On Fri, Apr 12, 2019 at 1:34 AM Mark Miller <eri...@gmail.com> wrote:
>
>
>
> On Thu, Apr 11, 2019 at 6:16 PM Mark Miller <eri...@gmail.com> wrote:
>>
>> Hi Arndt!
>> I'm very familiar with KeyKOS from reading and discussion. But I've never used it, so take my answers with a grain of salt.
>>
>> KeyKOS folks (yes Bill, I'm looking at you ;) ),
>> Am I getting it right? Could you answer the questions below that I could not?
>>
>>
>> On Thu, Apr 11, 2019 at 10:05 AM Arndt Schnabel <ArndtS...@freenet.de> wrote:
>>>
>>> I'm not sure if this is the right place to ask these questions, but I hope so.
>>>
>>> I could not find high-level documentation that answered this.
>>>
>>> 1. When a domain CALLs another domain, does the called domain run on its own meter or is its execution accounted for by the callers meter?
>>
>>
>> The called domain runs on its own meter.

This high level description of meters presents the domain's meter as a tree,
http://www.cap-lore.com/CapTheory/KK/Meters.html

Which makes me wonder if the keeper(?) can delegate to the callers
meter in some fashion.

>>> 2. Same situation as in (1) but with FORK
>>
>>
>> Still, the called domain runs on its own meter.
>>
>>>
>>>
>>> 3. With space banks, how did Tymshare/Keylogic sell memory access: "allocate up to x MB within the next month" vs "allocate up to x MB-months"? So, was there always a fixed memory limit or could you sell memory-seconds?
>>
>>
>> There was no primitive notion of memory-seconds. Rather, you could build such abstractions yourself by renting out the memory until you decide to evict.
>>
>>
>>>
>>> 4. Can a domain allocate more memory to a child domain than it has itself (see https://esolangs.org/wiki/RarVM#Hierarchical_resource_models:_intent_vs_transfer)?
>>
>>
>> Depends on what you mean by "has". In general, the rights that any domain has must be rights that were granted it by some domain that already had those rights. A parent can create a child occupying more memory than the parent occupies, but only by drawing on rights-to-occupy memory that the parent already had.
>
>
> Sorry, I did not follow your link before I replied. I think I now understand the question, and the answer is yes.
>
> You can make a child meter with more alleged capacity than its parent. Each unit in the child was also a claim against the parent. Using a unit decremented the balances of all meters from the meter being used up its ancestor chain to the root meter. (The actual implementation did this in O(1) time IIRC.) If any of those meters reached zero, then the process drawing on that meter was suspended and some meter keeper was invoked. However, I don't remember which meter keeper: the keeper of the child being used or the keeper of the ancestor that ran out?
>
> This is like fractional reserve and inspired the Fractal Reserve Bank that we did in Joule at the Agorics of the 1990s.
> http://erights.org/history/joule/MANUAL.BK8.pdf
> now safely protected by an expired patent
> https://patents.google.com/patent/US6161121
>

I recall that over allocation is discussed in this high-level
description of the of space banks, in the section on space limits,
http://www.cap-lore.com/CapTheory/KK/KKBank.html
Also this quote from the above, i find somewhat enlighting for the
motivation to support over allocation,
"buying a page actually allocates a frame from a fixed pool of blocks
on the disk system"

>>> 5. How closely does the set of domains resemble a network vs a hierarchy? Which parts of the system had synchronous vs asynchronous boundaries (were executed with CALL vs FORK)?
>>
>>
>> All of these were possible. I don't know what patterns were used in practice.
>>
>>>
>>> 6. Were there significant parts of the software that were executed with both?
>>
>>
>> I don't know.
>>
>>>
>>> 7. Was KeyKOS only used on timesharing systems or did later versions resemble personal operating systems?
>>
>>
>> KeyKOS was ported to the Omron, which was a personal system that did not see much use. It was also ported to Sun under the name GuardOS. You should also look at KeyKOS's descendants, Eros / CapROS, Coyotos, and seL4. Especially seL4 (!!!)
>>
>>
>>>
>>> If yes, is there an overview of findings that were made when trying to connect the capability system with other KeyKOS instances?
>>
>>
>> AFAIK, for everything in the KeyKOS family (Gnosis / KeyKOS / GuardOS, Eros / CapROS, Coyotos, seL4), there was never any attempt to stitch multiple instances together into a distributed capability system. I dreamed of doing this via E on KeyKOS. I now dream of doing this via Dr.SES on seL4.
>>
>>
>>>
>>> 8. Where is the best place to ask more questions?
>>
>>
>> This is probably the best place.
>>
>>
>>>
>>>
>>> Confuzzled,
>>
>>
>> Interesting word ;)
>>
>>>
>>> - Arndt
>>
>>
>>
>> --
>> Cheers,
>> --MarkM
>
>
>
> --
> Cheers,
> --MarkM
>

> --
> You received this message because you are subscribed to the Google Groups "cap-talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
> To post to this group, send email to cap-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cap-talk.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CAK5yZYhMb9B2AkX9%3D5nT-%3Dhozf9cxcAZVbeVibKk-HEvb-oOsQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Bill Frantz

unread,

Apr 12, 2019, 1:15:59 AM4/12/19

to cap-...@googlegroups.com

On 4/11/19 at 6:16 PM, eri...@gmail.com (Mark Miller) wrote:

>>4. Can a domain allocate more memory to a child domain than it has itself
>>(see
>>https://esolangs.org/wiki/RarVM#Hierarchical_resource_models:_intent_vs_transfer
>>)?
>>
>
>Depends on what you mean by "has". In general, the rights that any domain
>has must be rights that were granted it by some domain that already had
>those rights. A parent can create a child occupying more memory than the
>parent occupies, but only by drawing on rights-to-occupy memory that the
>parent already had.

Space banks are just objects, so a domain may hold reference to
a space bank that it doesn't use for allocating memory, but only
uses for making sub-banks which it passes to other domains for
their use.

KeyKOS had a Prime Space Bank, which had access to the range
keys that permitted allocation of pages and nodes. The design
specifically supported multiple range keys so space could be
segregated between different disk drives. Such segregation
helped reduce somekinds of covert channels.

Cheers - Bill

--------------------------------------------------------------
Bill Frantz | There are now so many exceptions to the
408-356-8506 | Fourth Amendment that it operates only by
www.pwpconsult.com | accident. - William Hugh Murray

Bill Frantz

unread,

Apr 12, 2019, 1:16:00 AM4/12/19

to cap-...@googlegroups.com

On 4/11/19 at 6:16 PM, eri...@gmail.com (Mark Miller) wrote:

> > 6. Were there significant parts of the software that were executed with
> > both?
> >
>
> I don't know.

Bill Frantz

unread,

Apr 12, 2019, 1:16:00 AM4/12/19

to cap-...@googlegroups.com

MarkM has correctly answered many of these questions. I'll try
to clarify when it seems valuable. BTW, I am off at a conference
now, so I may not be prompt.

On 4/11/19 at 6:33 PM, eri...@gmail.com (Mark Miller) wrote:

>You can make a child meter with more alleged capacity than its parent. Each
>unit in the child was also a claim against the parent. Using a unit
>decremented the balances of all meters from the meter being used up its
>ancestor chain to the root meter. (The actual implementation did this in
>O(1) time IIRC.) If any of those meters reached zero, then the process
>drawing on that meter was suspended and some meter keeper was invoked.
>However, I don't remember which meter keeper: the keeper of the child being
>used or the keeper of the ancestor that ran out?

The meter keeper invoked was always the keeper associated with
the meter that ran out. One of its parameters was a node key to
that meter, giving it that authority to add more time to the
meter and resume the process(es) that were suspended, This
mechanism allowed building complex schedulers.

Cheers - Bill

---------------------------------------------------------------------------
Bill Frantz | "I wish there was a knob on the TV to turn
up the
408-356-8506 | intelligence. There's a knob called
"brightness", but
www.pwpconsult.com | it doesn't work. -- Gallagher

Arndt Schnabel

unread,

Apr 12, 2019, 1:30:47 AM4/12/19

to cap-talk

The existence of these documents blows my mind! Absolutely amazing.

A few hours ago I found this: https://vimeo.com/269712952 ("The untold origins of cloud computing"), and again I'm thinking I was born too late. I wish I'd been there. It's all connected, the people, products and organizations. Yet today, I'm still feeling like I overlooked something, that something like Tymshare should already/again exist today, at least for economic reasons, though not in the mainframe, but in the distributed computation market sense. I suppose its not in the interest of larger corporations to develop standards like this, since they already supply the majority of cloud computation services and profit from lock-in effects. A general purpose computing market could create an inversion of control that may lead to highly specialized compiler optimizers and ASIC/FPGA providers, because the economic incentives will be there - a finer level of financialization. Also, these market mechanisms also give me hope that a substantial part of the IoT can organize itself.

I'm reading through the sel4 docs now, but I'm hoping for a platform-independent and deterministic, VM-level equivalent. Wasm may be it, but I'm skeptical. Or maybe a libuv-like protocol with caps.

A few more questions re. KeyKOS:

1. As I understand it, the CPU time accounting was between time slices and memory granularity in pages. Is this correct?

2. Did Tymshare ever experiment with more complex contracts, other than just selling CPU time and memory space? Options or Futures on these? Weird combinations? Could trading happen between users? Maybe there are old sales catalogues somewhere!

3. Has the phenomenon of a financially self-sustaining process ever occurred? :) What about glitches or hacks?

4. Were there other companies in the US or in other parts of the world who had similar systems?

5. Can you recommend any books, interviews or other media that can support an understanding of the systems at the time? I found "The Tym Before" by Nathan Gregory, I'm hope there are more.

Thank you both

- Arndt

Arndt Schnabel

unread,

Apr 12, 2019, 1:49:07 AM4/12/19

to cap-talk

Objects that are invoked (via FORK or CALL) and RETURN to the caller are usually written as an event loop. At least in sel4, the return capability may be stored and used later (if at all?), allowing the invoked domain to do other things first. I wonder if you created a taxonomy of these patterns, and which parts of the system used which pattern and how you named, formalized or visualized them.

C Oda

unread,

Jun 20, 2019, 8:25:51 AM6/20/19

to cap-talk

Trying to revive this thread, so maybe my two last messages are answered :)
I'd really love to listen to anyone who built (on) this system and its possibilities and what they think of the currently existing computation architectures and markets.

Bill Frantz

unread,

Jun 23, 2019, 9:02:09 PM6/23/19

to cap-...@googlegroups.com

I was one of the original KeyKOS (nee Gnosis) team. I'll be
happy to try to answer your questions, but I may have difficulty
understanding them, since, being retired, I'm not super familiar
with the newer systems.

Cheers - Bill

-------------------------------------------------------------------------------------
Bill Frantz | Government is not reason, it is not
eloquence, it is force; like
408-356-8506 | a fire, a troublesome servant and a fearful
master. Never for a
www.pwpconsult.com | moment should it be left to irresponsible
action. Geo Washington

Jonathan S. Shapiro

unread,

Jun 24, 2019, 7:39:59 PM6/24/19

to cap-...@googlegroups.com

On Thu, Apr 11, 2019 at 10:49 PM Arndt Schnabel <ArndtS...@freenet.de> wrote:

Objects that are invoked (via FORK or CALL) and RETURN to the caller are usually written as an event loop. At least in sel4, the return capability may be stored and used later (if at all?), allowing the invoked domain to do other things first. I wonder if you created a taxonomy of these patterns, and which parts of the system used which pattern and how you named, formalized or visualized them.

Sort of. There is the qualifier that the KeyKOS/EROS FORK operation does not create a RESUME (REPLY) capability, so unless the FORK forwards a RESUME/REPLY capability that it has obtained from somewhere else there is no way for the invokee to respond in the usual way. This pattern is useful for "forwarding" an invocation, but it isn't common.

The fact that the RESUME/REPLY capability is consumed by use is a somewhat polarizing design choice. It tends to force software designs into a call/return pattern rather than streaming pattern. There are arguments pro and con for both patterns.

> > 6. Were there significant parts of the software that were executed with
> > both?
> >
>
> I don't know.

I don't understand the question.

I believe the reference here may have been to "synchronous vs. asynchrous" from the tail of question 5.

All invocations in KeyKOS are blocking and synchronous. Invocations are not buffered at the kernel level. As with the CALL/RETURN pattern, this has pros and cons. Both L4 and Coyotos introduced variations on non-blocking notifications similar to UNIX signals (mainly used in driver-like contexts for interrupt notification), which was absent in KeyKOS/EROS. The inability to do reliably non-blocking communication creates certain kinds of denial of service opportunities that are difficult to surmount without them. In Coyotos, the motivation for notifications came from our work on high-speed user-mode networking, where we found it very cumbersome to do ring buffers without them.

Jonathan

Jonathan S. Shapiro

unread,

Jun 24, 2019, 7:48:36 PM6/24/19

to cap-...@googlegroups.com

On Thu, Jun 20, 2019 at 5:25 AM C Oda <ArndtS...@freenet.de> wrote:

Trying to revive this thread, so maybe my two last messages are answered :)
I'd really love to listen to anyone who built (on) this system and its possibilities and what they think of the currently existing computation architectures and markets.

We need Alan Bomberger! :-)

OK. A more constructive response:

The KeyKOS/EROS computational model (or rather: messaging model) is basically synchronous call/return. As originally implemented, hardware multiprocessing doesn't play well with the comm architecture in some ways. It's a fundamentally different comm architecture from UNIX or Windows, both because the comm semantics are different and because nothing is buffered by the kernel. It's not streams (unix), nor is it Windows COM (which is buffered). In consequence, it's difficult to do an apples to apples comparison or evaluation.

There really weren't enough user-level applications built on the KeyKOS/EROS/Coyotos family to make a sensible comparison of how the system compares in the large.

As to current markets, there's been a significant amount of work on market-style scheduling regimes in other systems, and the outcome is that they don't work out very well. All of the concrete implementations have fairly significant issues with starvation, and there are strong arguments (both for and against) for priority inheritance. Priority inheritance really isn't addressed by market-style resource management, but it can be crucial to building a system that functions decently in the real world.

Jonathan

Matt Rice

unread,

Jun 25, 2019, 6:57:23 PM6/25/19

to cap-...@googlegroups.com

always thought that a good example application, which is both simple &
complex enough to display some of the difficulty of KeyKOS/EROS
messaging model is
a virtual console type of keyboard multiplexer. In the scenerio there
are multiple keyboard reading applications, a control which switches
between which the current reader,
A timer for blinking the cursor at regular intervals.

the e.g. keyboard driver being multiplexed, doesn't want to block on
any single reader, because the control needs to switch between
readers.
The applications receiving keyboard input itself don't want to block
on any *one* device, e.g. they want to poll on both the timer, and the
keyboard input.

In the KeyKOS/EROS fork model This leads to needing n + 1 forks to
avoid the keyboard device reader blocked by any one application,
one for interrupts, and one for each caller. And the application
itself needing 2 forks one for waiting on the keyboard, and one
waiting for the timer,
essentially an n-input multiplexer going into a two input multiplexer + timer.

While with the asynchronous notification model we can implement it in
something like n + 1 processes through the notification mechanism.
Basically seems like the same issue as the ring buffer problem without
all the weight of networking etc.

Jonathan S. Shapiro

unread,

Jun 26, 2019, 3:45:35 PM6/26/19

to cap-...@googlegroups.com

On Tue, Jun 25, 2019 at 3:57 PM Matt Rice <rat...@gmail.com> wrote:

> All invocations in KeyKOS are blocking and synchronous. Invocations are not buffered at the kernel level. As with the CALL/RETURN pattern, this has pros and cons. Both L4 and Coyotos introduced variations on non-blocking notifications similar to UNIX signals (mainly used in driver-like contexts for interrupt notification), which was absent in KeyKOS/EROS. The inability to do reliably non-blocking communication creates certain kinds of denial of service opportunities that are difficult to surmount without them. In Coyotos, the motivation for notifications came from our work on high-speed user-mode networking, where we found it very cumbersome to do ring buffers without them.
>

always thought that a good example application, which is both simple &
complex enough to display some of the difficulty of KeyKOS/EROS

messaging model is a ...keyboard multiplexer.

That's a good example, but I think there may be a simpler example, and it's actually the one that drove the new "notification" mechanism in Coyotos: bi-directional streams with flow control. This applies to terminals, but also (sometimes at a different level of abstraction) to network streams.

In a terminal driver (say RS-232), you have an "in" stream of characters and an "out" stream of characters. So far, so good. The trouble comes when the two streams have to interact. In the RS-232 case, this can happen in two ways (and probably others I'm not remembering offhand):

1. The hardware CTS line is de-asserted. This is the hardware-level flow control mechanism.

2. Ctrl-S (software flow control in ASCII) or Ctrl-O (software source quench in ASCII) is typed at the keyboard.

Both of these should be thought of as events that happen on the in-bound channel. The in-bound stream receives one of these events. So far, so good. But now it has to reach over and gronk the out-bound channel over the head and tell it to shut up. In the case of Ctrl-O it may also need to tell the outbound channel to discard all currently buffered characters (which, incidentally, yields unpredictable results - the buffer gets emptied, but nobody in sight can tell how much is in the buffer at that point).

At this point we have a problem. There are two ways the in-bound channel can advise the out-bound channel to stop:

1. It can send a message (via an invocation)

2. It can leave a suitable marker in a shared-memory buffer.

But the out-bound channel may be blocked somewhere, so neither of these is guaranteed to provide prompt response. If the out-bound thread is not in the "available" state (some systems call this "listening"), what will happen on a CALL is that the in-bound thread will become blocked until the out-bound thread becomes available. Note that this is precisely backwards! What we wanted was for the out-bound thread to stop.

This is a concrete example of a general hardware-level pattern in which one thread of control needs to preempt another. There exists no clean or efficient mechanism in KeyKOS or EROS by which this can happen. It was one of two motivating reasons for re-designing the invocation mechanism in Coyotos.

What was done in Coyotos was to add a mechanism similar to scheduler activations (which, BTW, I really like). This was coupled with a "notification" mechanism. A notification is similar in some respects to a UNIX signal: it set's a bit in a well-known place (which may already be set). A schedule activation is then issued that performs an orderly in-process context switch into the activation handler. This preempts the actions of the receiver while leaving it's current operation in a cleanly restartable state.

The tricky bit was that the preempted process might be in the middle of a call&receive operation or a similarly two-phase operation. It became necessary to establish a clean and well-defined phase separation between these two operations. If the call phase has completed, and the system call is re-issued, we do not want to repeat the call phase. In effect, the system call operation now consists of more than one unit of operation.

In some sense this violated the KeyKOS atomicity principle for system calls.

The way it was actually done was to make both phases optional, with "enable" bits in the system call control word corresponding to each phase. A process doing a call&receive sets enable bits for both phases. Once the call phase is completed, the enable bit for the call phase is cleared from the kernel, indicating that the call phase has completed, and simultaneously ensuring that re-starting the system call will not re-execute the call.

One of the tricky bits about system calls that entail multiple units of operation is for the application to understand which units have completed and which have not in the event that the call gets preempted (which is sometimes necessary). It's tricky in every system - UNIX went through three or four progressively better specifications of the read() and write() system calls trying to resolve this sort of thing cleanly, and it still struggles. Not because UNIX is broken, but because some channels (like RS-232) are asynchronous and the answer isn't precisely knowable for these types of channels.

In any case, the particular interface we chose for Coyotos resolved the "how far did we get" problem using these enable bits.

In the KeyKOS/EROS fork model This leads to needing n + 1 forks to
avoid the keyboard device reader blocked by any one application,
one for interrupts, and one for each caller. And the application
itself needing 2 forks one for waiting on the keyboard, and one
waiting for the timer,
essentially an n-input multiplexer going into a two input multiplexer + timer.

Yes. And on a modern Pentium-family processor each of those threads carries about 4 kilobytes of state. Which is a lot even on modern systems; it significantly bloats the portion of the address space that the kernel must reserve, which creates various challenges for operating system environment emulation, and also for general cleanliness. As the per-thread state rises, the absence of a preemption mechanism ceases to be a sustainable choice decision in the kernel.

Jonathan

C Oda

unread,

Jul 25, 2019, 3:01:58 AM7/25/19

to cap-talk

Thank you for your responses!

Another question:

0. When creating a cap system, would a fixed upper limit be harmful? If I remember correctly, the KeyKOS docs stated that a domain could have up to n (like 16, or 256 perhaps) capabilities, but allowed for caps to cap tables, which made the number of ownable caps effectively infinite. For people working with cap systems, what was the typical cap ownership count per object - and what did the distribution look like when considering an entire system? Most objects owning, say, <16 caps, maybe with some outliers who owned several hundred? A Gini coefficient could be interesting.

And these questions from above, I think they went unnoticed:

1. As I understand it, the CPU time accounting was between time slices and memory granularity in pages. Is this correct?

2. Did Tymshare ever experiment with more complex contracts, other than just selling CPU time and memory space? Options or Futures on these? Weird combinations? Could trading happen between users? Maybe there are old sales catalogues somewhere!

3. Has the phenomenon of a financially self-sustaining process ever occurred? :) What about glitches or hacks?

4. Were there other companies in the US or in other parts of the world who had similar systems?

5. Can you recommend any books, interviews or other media that can support an understanding of the systems at the time? I found "The Tym Before" by Nathan Gregory, I'm hoping there are more.

Baldur Jóhannsson

unread,

Jul 25, 2019, 4:58:37 AM7/25/19

to cap-talk

To answer your zeroth enumerated question indirectly: KeyKos has two kinds of most primitive storage units. The page (4096 bytes) and the node (16 caps). As only nodes could hold capabilities we consider only them. Due to the fact that one can have a capability to a spefic node in one of the slots of another node then one can build arbitary structures out of nodes.

C Oda

unread,

Jul 25, 2019, 5:29:04 AM7/25/19

to cap-talk

I'm currently considering which data structure I want to use to store capabilities.

I was considering bitmaps for a while, either "target" (each domain has bitmaps for the other domains it has access to) or "source" (each domain stores the bitmap of which other domains have read/write/call caps to itself), because they would provide constant-time overhead on creation, until it became clear that a fixed bitmap size limits (1) the total number of domains in the system, (2) is inefficient for large systems with many domains because they are sparse, (3) "target" bitmaps would require iterating over all domains to clear the bits on the destruction of a domain, though the "source" approach alleviates this problem.

So now I'm wondering what other data structure I could use, that is at most O(N), at best constant, in both domain creation and destruction and capability transfer. I would like it to be constant, or at least independent of the number of domains, even if this meant that there was some limit to the number of domains and a higher constant overhead.

I faintly remember a page somewhere that described the different implementation techniques (large random numbers, indices into address tables which are not accessible by the domain, ???), but can't find it.

Kevin Reid

unread,

Jul 25, 2019, 11:58:28 AM7/25/19

to cap-...@googlegroups.com

On Thu, Jul 25, 2019 at 2:29 AM C Oda <ArndtS...@freenet.de> wrote:

I'm currently considering which data structure I want to use to store capabilities.

I was considering bitmaps for a while, either "target" (each domain has bitmaps for the other domains it has access to) or "source" (each domain stores the bitmap of which other domains have read/write/call caps to itself), because they would provide constant-time overhead on creation, until it became clear that a fixed bitmap size limits (1) the total number of domains in the system, (2) is inefficient for large systems with many domains because they are sparse, (3) "target" bitmaps would require iterating over all domains to clear the bits on the destruction of a domain, though the "source" approach alleviates this problem.

Using either of these would tend to support non-capability-oriented application code because they separate designation (of a domain) from authority (whether the bit is set). General principle: the user-space code should be unable to express "invoke this domain [and fail if I don't have access]", only "invoke this capability" where a capability is something it can discard or duplicate, in the manner of a file descriptor.

Bill Frantz

unread,

Jul 25, 2019, 7:41:26 PM7/25/19

to cap-...@googlegroups.com

Think of the 16 caps a domain can hold as being 16 cap
registers. Other caps can be held in objects accessed through
the cap registers. Some of the most useful:

* The record collection: Kept caps and data accessed by names of
up to 255 characters.
* The supernode: Kept up to 2**32 caps.
* The small integer allocator: Allowed safe allocation and
deallocation of slots in a supernode between concurrent domains.

On 7/25/19 at 12:01 AM, ArndtS...@freenet.de (C Oda) wrote:

>For people working with cap systems, what was the typical cap
>ownership count per object - and what did the distribution look
>like when considering an entire system? Most objects owning,
>say, <16 caps, maybe with some outliers who owned several
>hundred? A Gini coefficient could be interesting.

The smallest objects kept all their caps in the cap registers.
The larger ones kept many caps. We never did an analysis of how
many caps domains held directly, but a supernode could hold a lot.

>And these questions from above, I think they went unnoticed:
>1. As I understand it, the CPU time accounting was between time
>slices and memory granularity in pages. Is this correct?

The 370 had a CPU clock that ran with a precision close to the
cycle time of the machine. We used that to measure domain CPU
usage. The accounting for channels and main storage had not been
implemented, but see below for the inspiration for the idea in
Tymshare's VM/370 system.

>2. Did Tymshare ever experiment with more complex contracts,
>other than just selling CPU time and memory space? Options or
>Futures on these? Weird combinations? Could trading happen
>between users? Maybe there are old sales catalogues somewhere!

KeyKOS was only used for bespoken applications which used an
entire real or virtual machine. As such, the meter accounting
was not heavily used.

The Tymshare version of VM/370 was modified to charge for CPU
time, working set size, and page in/out operations to that
working set.

>3. Has the phenomenon of a financially self-sustaining process
>ever occurred? :) What about glitches or hacks?

I'm not sure what your are thinking about here, but certainly
many applications were financially self-sustaining in that they
produced more value than they cost. Tymshare sold applications,
and users paid for the computation used and sometimes a premium
for the software package.

>4. Were there other companies in the US or in other parts of
>the world who had similar systems?

Tymshare had a number of competitors. Some were EDS, Boeing
Computer Services, and Compuserve. Each had a somewhat different
emphasis, some more batch oriented, some more online.

>5. Can you recommend any books, interviews or other media that
>can support an understanding of the systems at the time? I
>found "The Tym Before" by Nathan Gregory, I'm hoping there are more.

I don't know of any personally. Perhaps the Computer History
Museum has some useful information.

Cheers - Bill

-------------------------------------------------------------------------
Bill Frantz | The first thing you need when | Periwinkle
(408)356-8506 | using a perimeter defense is a | 16345
Englewood Ave
www.pwpconsult.com | perimeter. | Los Gatos,
CA 95032

C Oda

unread,

Jul 25, 2019, 11:44:17 PM7/25/19

to cap-talk

Thank you both very much! :)

C Oda

unread,

Jul 29, 2019, 1:57:21 AM7/29/19

to cap-talk

I changed my single-threaded, capless, resource aware, hierarchically encapsulated virtual machine to have a now network structure (flat list of domains) with capabilities: https://imgur.com/a/RKrThKe

So there is a RECURSE instruction that takes a capability, an instruction step/time limit ("gas") and a memory allocation limit and executes the domain the capability points to with these restrictions.

A TRANSFERKEY instruction allows domains to pass caps to other domains whose caps they own.

I think this is close to what KeyKOS did.

Now I'm wondering if

1. A domain should receive it's own key at creation

2. A domain should be able to call itself (it might own and transfer the key, but not use it itself)

3. A domain should be able to call the domains that called it (those that are in the call/resource chain)

4. If yes, what the semantics of this are, especially how this interacts with the resource limits

5. If domains, when called, should start at the same instruction, say, IP=0 (the only state is in the domains memory) or if there is value to keeping the execution state alive

Bill Frantz

unread,

Jul 29, 2019, 9:37:53 AM7/29/19

to cap-...@googlegroups.com

On 7/28/19 at 10:57 PM, ArndtS...@freenet.de (C Oda) wrote:

>1. A domain should receive it's own key at creation

In KeyKOS, a domain receives a domain key to itself. The domain
key allows debugging access, as well as the ability to create
start keys for general callers.

>2. A domain should be able to call itself (it might own and
>transfer the key, but not use it itself)

KeyKOS did not allow a domain to call itself. The kernel had no
stack facilities to save the return links.

>3. A domain should be able to call the domains that called it
>(those that are in the call/resource chain)

Ditto

>5. If domains, when called, should start at the same
>instruction, say, IP=0 (the only state is in the domains
>memory) or if there is value to keeping the execution state alive

KeyKOS domains always started at the instruction after the
return instruction. (This could be changed with a domain key.)
This policy permitted coroutine linkages and seperation of
one-time initialization code from the main processing loop (ala
standard Arduino coding).

Cheers - Bill

-------------------------------------------------------------------------
Bill Frantz | Re: Hardware Management Modes: | Periwinkle
(408)356-8506 | If there's a mode, there's a | 16345
Englewood Ave
www.pwpconsult.com | failure mode. - Jerry Leichter | Los Gatos,
CA 95032

C Oda

unread,

Jul 30, 2019, 5:16:19 AM7/30/19

to cap-talk

The complexity of metering only now dawns on me.

Deterministic systems like Ethereum and all other current Blockchain VMs meter time at instruction-step granularity, with a "gas" cost that is proportional to the time used (calibrated on some hypothetical "average" machine) times some factor depending on transaction demand on the network. A newer approach (https://github.com/ewasm/wasm-metering) inserts calls to a metering function before every block (like this: https://github.com/void4/notes/issues/25), so the granularity is coarser, but the overhead lower. This still guarantees that the gas used is less than the gas limit. This approach also means that all code must be immutable (otherwise the metering function calls could be modified) and that there can only be instruction-block granularity agreement/consensus on execution state. On Ethereum, memory has no time-cost (yet), there are only allocation-costs and deallocation-refunds.

Now, if there should be a memory-time cost, it is unclear what to charge for:

- all memory that is allocated, even if not in use (huge overhead, but truly reflects the systems resource usage, but not the potential for a domain to access it)

- only the memory that computations _could_ be performed on (cost of the option)

- only the memory that is actually used

1. Which of these did KeyKOS charge for?

2. Did the KeyKOS implementation use interrupts for scheduling?

3. Did it track time between timeslices and then subtract that from the meter?

4. Was a check performed _before_ granting the timeslice that it couldn't be longer than the time available in the meter? (If not, is it possible to say what the unmetered "slippage" was?)

5. Do you think it would be possible to implement a fully deterministic variant of KeyKOS?

Bill Frantz

unread,

Jul 30, 2019, 4:29:52 PM7/30/19

to cap-...@googlegroups.com

KeyKOS only metered CPU time, although the plan was to use the
same logic as the Tymshare modified VM system used for virtual
memory. Remember also that these systems were not designed for a
networked environment. Tymnet was used to connect terminals to
hosts. Yes, hosts could use Tymnet to connect to other hosts,
but in general computation wasn't spread over hosts. All network
use was charged by the number of bytes transferred.

CPU time was measured using the 370's CPU timer. This measured
program execution time with a precision similar to the cycle
time of the specific 370 model. It counted down and generated an
interrupt when it reached zero which was used for time slicing.

The memory charging logic of the Tymshare VM system recognized
that there were two costs to virtual memory use. The program's
working set had to be kept in main memory, and that memory was
expensive. (A 370 had a maximum of 16 megabytes of main memory.)
The other cost was the cost of the disk I/O to move pages
between disk and main memory. (Since paging was the only disk
I/O in KeyKOS, we didn't have to figure out how to charge for
disk file I/O.) The meter was designed to measure main memory
occupancy (number of pages times time) and transfers to and from
disk to that memory. Making these measures reproducible was one
of the neat tricks in the algorithm.

>2. Did the KeyKOS implementation use interrupts for scheduling?

Yes. domains got a CPU time slice reflected in the value loaded
into the CPU timer. Another domain was run when that slice ran out.

>3. Did it track time between timeslices and then subtract that
>from the meter?

No. The CPU timer was reloaded as part of domain dispatch. as
were the general purpose and floating point registers.

>4. Was a check performed _before_ granting the timeslice that
>it couldn't be longer than the time available in the meter? (If
>not, is it possible to say what the unmetered "slippage" was?)

The meters exist in a tree, based in the "prime meter". Unless a
part of that tree, a node can not be used as a meter. Each
prepared meter node had an associated cache of CPU time which
had been deducted from all the meters between it and the prime
meter. These caches could be scavenged to allow other domains to
run. Scavenging occurred when the time left in a superior meter
got low. There is an assertion that the value loaded into the
CPU timer is never larger than that cache of CPU time.

>5. Do you think it would be possible to implement a fully
>deterministic variant of KeyKOS?

I'm not sure what you mean. There is some time charged to a
domain during an interrupt including the necessary processor
time to take the interrupt and to save the CPU timer. There is
similar time during dispatch. The best one could do to be
completely reproducible in the time charged would be to
calculate what these times are and compensate for them. Doing so
would be very hard since memory cache effects would change the
times. Of course, the cache effects would affect the time it
took to run a program as well.

Cheers - Bill

-----------------------------------------------------------------------
Bill Frantz | When all else fails: Voice | Periwinkle
(408)356-8506 | and CW. | 16345
Englewood Ave
www.pwpconsult.com | | Los Gatos,
CA 95032

C Oda

unread,

Jul 30, 2019, 10:03:20 PM7/30/19

to cap-talk

commence undirected rambling

If agreement on the results of a computation between mutually untrusting parties is needed for a system (mostly for distributed computation, smart contracts and shared databases), there are several ways to go about it, at various levels of granularity, as described above.

Every (intermediate) state that should be (dis)agreed upon needs to be arrived at through a reproducible, - depending on the requirements even down-to-the-bit - deterministic process.

In a fully deterministic system, it is even possible to hash the entire execution state and compare these, only replaying the execution if a disagreement results (This is what the truebit project wants to achieve, and a few others as well. One use case: general purpose, decentralized BOINC). This rules out usage of "real" time, which can be replaced by a deterministic proxy, what the Ethereum Virtual Machine calls gas. I'm mostly interested in how this can be combined with caps, I don't really have a 'why', though I think this would be cool.

Ethereum is not a capability secure system, all smart contract objects live in a global address space, everything can call everything. Now, it would be possible to create and deploy a smart contract that managed a cap-secure environment in its private storage, to be called by contracts deployed in the future, whenever another contract calls them, and used to create new contracts, in return receiving a key to them. But this is opt-in, and does not make the entire system cap-secure by default.

The originator of a transaction (the only thing that can change the state of and introduce new data to this leaderless replicated machine) pays for the computation/call tree that results (even if it fails). If a transaction fails because it runs out of resources, the partial state is discarded, the changes are rolled back. There is no way to resume/refuel a large transaction that may fail. This also limits single transactions to the total computation limit of a block.

Since transactions always come from the "outside", there is no setTimeout() watchdog or anything else that can cause a state change automatically on a new block/mutually agreed update, without a new transaction. There are new blockchain protocols that allow for this possibility, even replacing objects which are not under global consensus (such as the consensus algorithm itself, or the networking stack) "from above" (which is pretty cool/scary because this could be the beginning of financially self-sustaining autonomous processes. think of todays' mental requirements of renting a server, it's very human oriented, names, bank accounts, emails etc. Right now, an otherwise profitable program cannot live on the web if it can't pay for its own resources somehow, which involves emulating a human. but this could change, someone please write some sci-fi about this! Digital darwinism? Non-human cooperation? Maybe I watched too much Tron: a whole new lifeform - the conditions were right and they came into being). A few years ago only the Tendermint project supported a limited variant of this, storing the account whitelists and transaction permissions on-chain, but in principle everything can be replaced. This becomes much easier when both the objects under consensus and those that support them are running on the same substrate/virtual machine (say, WebAssembly).

What also hasn't been explored is creating and splitting off new agreement objects from existing ones on demand. With caps, this would just involve cutting all connections to an object network and instructing the substrate to install a second prime meter. BOOM, a second network! Ethereum has always been one monolithic network, on its own obscure virtual machine. This creates scalability problems because all computations have to be repeated by all validating nodes in the network, which is therefore limited in speed by its slowest machine. Even a decade after Bitcoin it has not become trivial to create new agreement objects that are not (consensus-/technology-/community-) dependent on existing networks. Humans in the loop everywhere. It seems to me that the ocap-discipline with reified rights as its main abstraction combined with a deterministic virtual machine might make this much more elegant.

Jonathan S. Shapiro

unread,

Aug 16, 2019, 12:56:48 PM8/16/19

to cap-...@googlegroups.com

On Thu, Jul 25, 2019 at 12:01 AM C Oda <ArndtS...@freenet.de> wrote:

Another question:
0. When creating a cap system, would a fixed upper limit be harmful? If I remember correctly, the KeyKOS docs stated that a domain could have up to n (like 16, or 256 perhaps) capabilities, but allowed for caps to cap tables, which made the number of ownable caps effectively infinite.

The short answer is: you want a capability address space, and you want something analogous to a stack for capabilities. Without these, most of the idioms that we rely on in in higher-level programming languages don't really work. The workarounds (and we tried a lot of them in EROS at various points) are cumbersome and error prone. Our view in the EROS/Coyotos process was that it was important to move to higher-level languages and idioms (that is: higher than assembly) because we wanted to be able to analyze programs.

It is possible in KeyKOS to store capabilities to nodes, and to build an arbitrary-size tree of nodes that can be used to offload capabilities. Doing so requires a lot of kernel calls, which makes it pretty inefficient. In Coyotos we went initially to a first-class capability address space, and later to a unified address space in which some pages were capability pages rather than data pages. The last is where we ended up - it allowed us to reuse a lot of kernel code in a very natural way. I don't now recall what the state of things was w.r.t. a capability stack. The stack mainly becomes an issue when you start to build libraries that need to pass descriptors around - once you start doing this, you want to have a way to "spill" descriptors from registers.

In KeyKOS, for the most part, you should think of a domain as having 16 capability registers.

For people working with cap systems, what was the typical cap ownership count per object - and what did the distribution look like when considering an entire system? Most objects owning, say, <16 caps, maybe with some outliers who owned several hundred? A Gini coefficient could be interesting.

This is a lot like asking "how many pointers does a typical program need?" It varies widely according to the program and the style of program construction.

1. As I understand it, the CPU time accounting was between time slices and memory granularity in pages. Is this correct?

Pages have nothing to do with time accounting. Different generations of this system took very different approaches to CPU time accounting.

2. Did Tymshare ever experiment with more complex contracts, other than just selling CPU time and memory space? Options or Futures on these? Weird combinations? Could trading happen between users? Maybe there are old sales catalogues somewhere!

I don't know what Tymshare did, but I'd observe that the more complicated you make these things, the more vulnerable to compromise they become.

4. Were there other companies in the US or in other parts of the world who had similar systems?

seL4 eventually adopted many of the critical ideas. That system has shipped in billions of instances.

5. Can you recommend any books, interviews or other media that can support an understanding of the systems at the time? I found "The Tym Before" by Nathan Gregory, I'm hoping there are more.

Hank Levy's book "Capability-Based Computer Systems" may be of interest. I believe he has put a scan of the book online.

Jonathan

Mark S. Miller

unread,

Aug 16, 2019, 4:54:08 PM8/16/19

to cap-...@googlegroups.com

On Fri, Aug 16, 2019 at 9:56 AM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Thu, Jul 25, 2019 at 12:01 AM C Oda <ArndtS...@freenet.de> wrote:
Another question:
0. When creating a cap system, would a fixed upper limit be harmful? If I remember correctly, the KeyKOS docs stated that a domain could have up to n (like 16, or 256 perhaps) capabilities, but allowed for caps to cap tables, which made the number of ownable caps effectively infinite.

The short answer is: you want a capability address space, and you want something analogous to a stack for capabilities. Without these, most of the idioms that we rely on in in higher-level programming languages don't really work. The workarounds (and we tried a lot of them in EROS at various points) are cumbersome and error prone. Our view in the EROS/Coyotos process was that it was important to move to higher-level languages and idioms (that is: higher than assembly) because we wanted to be able to analyze programs.

It is possible in KeyKOS to store capabilities to nodes, and to build an arbitrary-size tree of nodes that can be used to offload capabilities. Doing so requires a lot of kernel calls, which makes it pretty inefficient. In Coyotos we went initially to a first-class capability address space, and later to a unified address space in which some pages were capability pages rather than data pages. The last is where we ended up - it allowed us to reuse a lot of kernel code in a very natural way. I don't now recall what the state of things was w.r.t. a capability stack. The stack mainly becomes an issue when you start to build libraries that need to pass descriptors around - once you start doing this, you want to have a way to "spill" descriptors from registers.

In KeyKOS, for the most part, you should think of a domain as having 16 capability registers.
For people working with cap systems, what was the typical cap ownership count per object - and what did the distribution look like when considering an entire system? Most objects owning, say, <16 caps, maybe with some outliers who owned several hundred? A Gini coefficient could be interesting.

This is a lot like asking "how many pointers does a typical program need?" It varies widely according to the program and the style of program construction.

1. As I understand it, the CPU time accounting was between time slices and memory granularity in pages. Is this correct?

Pages have nothing to do with time accounting. Different generations of this system took very different approaches to CPU time accounting.

One of my favorite seL4 papers is "Scheduling-context capabilities: A principled, light-weight OS mechanism for managing time" at

https://ts.data61.csiro.au/publications/csiroabstracts/Lyons_MAH_18.abstract.pml

2. Did Tymshare ever experiment with more complex contracts, other than just selling CPU time and memory space? Options or Futures on these? Weird combinations? Could trading happen between users? Maybe there are old sales catalogues somewhere!

I don't know what Tymshare did, but I'd observe that the more complicated you make these things, the more vulnerable to compromise they become.

4. Were there other companies in the US or in other parts of the world who had similar systems?

seL4 eventually adopted many of the critical ideas. That system has shipped in billions of instances.

It has? Did you mean "L4 has shipped in billions of instances"?

5. Can you recommend any books, interviews or other media that can support an understanding of the systems at the time? I found "The Tym Before" by Nathan Gregory, I'm hoping there are more.

Hank Levy's book "Capability-Based Computer Systems" may be of interest. I believe he has put a scan of the book online.

https://homes.cs.washington.edu/~levy/capabook/

Great book.

Jonathan

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QM%3DDVLDGUmDeTA-_bF9u9Xp6ZbCBNh2Fync_pYd%3DRUeoA%40mail.gmail.com.

Jonathan S. Shapiro

unread,

Aug 17, 2019, 4:16:32 PM8/17/19

to cap-...@googlegroups.com

On Fri, Aug 16, 2019 at 1:54 PM Mark S. Miller <ma...@agoric.com> wrote:

seL4 eventually adopted many of the critical ideas. That system has shipped in billions of instances.

It has? Did you mean "L4 has shipped in billions of instances"?

Yes. Specifically the seL4 variant, as I understand it.

Matt Rice

unread,

Aug 17, 2019, 4:34:47 PM8/17/19

to cap-...@googlegroups.com

If i recall correctly, the version shipped to billions was OKL4, a
predecessor designed by some of the same people involved with seL4,
but written by a different company, I.e. seL4 was written from scratch
and not sharing any history besides the people involved, and lessons
learned.

Jonathan S. Shapiro

unread,

Aug 17, 2019, 4:46:41 PM8/17/19

to cap-...@googlegroups.com

That sounds right - my memory of which versions are which at this point is fuzzy, but since OKLabs had numerous carrier contracts, what you say would make sense.

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CACTLOFpqJRRCRUje7iGLNt_NCa4tx8bvt0s_QnJHv7LLQeRpyg%40mail.gmail.com.

Gernot Heiser

unread,

Aug 17, 2019, 7:35:04 PM8/17/19

to cap-...@googlegroups.com

On 18 Aug 2019, at 06:34, Matt Rice <rat...@gmail.com> wrote:

It has? Did you mean "L4 has shipped in billions of instances"?

Yes. Specifically the seL4 variant, as I understand it.

If i recall correctly, the version shipped to billions was OKL4, a
predecessor designed by some of the same people involved with seL4,
but written by a different company, I.e. seL4 was written from scratch
and not sharing any history besides the people involved, and lessons
learned.

essentially correct. At UNSW/NICTA we did the embedded fork of the Karlsruhe Pistachio kernel, which then developed into OKL4 and started shipping on Qualcomm modem chips from 2006. The kernel running on the secure enclave of iOS devices is also derived from the L4-embedded kernel.

seL4 is only now starting to ship in products, although there are almost certainly deployments we don’t know about.

Gernot

C Oda

unread,

Feb 23, 2020, 10:43:59 AM2/23/20

to cap-talk

I'm trying to read more in http://www.cap-lore.com/Agorics/Library/KeyKos/ but I have a question that may be answered more quickly here:

How does domain code refer to Keys, on the lowest level?

Does every key have a domain-relative unique ID? Or are they referred to always relative to a node? Say, Node 15, Slot 3?

In an explicit invocation (from here: http://www.cap-lore.com/Agorics/Library/KeyKos/Principles/chapter2b.html),

it seems that 4 keys are referred to by their slot number.

But I'm wondering which instructions let the domain code copy the keys into that slot.

There must be an instruction that goes like: "Copy NodeA[SlotX] to NodeB[SlotY]"

P.S.: Is there any way to contact Ann Hardy?

Chris Hibbert

unread,

Feb 23, 2020, 4:12:37 PM2/23/20

to cap-...@googlegroups.com, C Oda

> P.S.: Is there any way to contact Ann Hardy?

I've sent Ann's address to Arndt off-list.

Chris
--
I think that, for babies, every day is first love in Paris. Every
wobbly step is skydiving, every game of hide and seek is Einstein
in 1905.--Alison Gopnik (http://edge.org/q2005/q05_9.html#gopnik)

Chris Hibbert
hib...@mydruthers.com
Blog: http://www.pancrit.org
Twitter: C_Hibbert_reads
http://mydruthers.com

Baldur Jóhannsson

unread,

Feb 23, 2020, 4:16:54 PM2/23/20

to cap-talk

Hmm... think of an KeyKOS domain like an simple MCU (MicroContolUnit, arduino has one for an example).

It has memory which it fetches instructions from and
keeps its data. It has a few registers in which most
immediate state is kept. (Program Counter, index registers, stack pointer, arithmetic registers and such).

Now, imagine the domain has capability registers.
Those can be reffered in a call on a capability, as
the invoked capability, the arguments caps to pass
along, and lastly where to put any capabilities returned.

So, there are no special instructions to move a cap
from SlotX in NodeA to SlotY in NodeB.
What domain code must do is to achive that:
1. invoke «get»(slotXnr) on NodeA and save it in
a capability register, say 0xA.
2. invoke «put»(slotYnr, cap_0xA) on NodeB and
thence pass that saved capability from
the capability register 0xA.

As nodes are primitive constructs in KeyKOS the
kernel handles those invocations. But there is,
usually, nothing that prevents a "node" being
implemented by another domain so long that
implementation follows the node interface.

Does this illuminate a bit how KeyKOS works?

C Oda

unread,

Feb 23, 2020, 4:34:09 PM2/23/20

to cap-talk

Does anyone know what the K in KeyKOS stands for? I have never seen an explanation for that anywhere.

Mark S. Miller

unread,

Feb 23, 2020, 5:03:14 PM2/23/20

to cap-...@googlegroups.com

The makers of KeyKOS, first at Tymshare and then at Key Logic, needed a compact way to talk about capabilities, as "capability" is quite a mouthful. In the historical context in which these discussions took place, cryptography was far from an immediate concern, and conventional keys (e.g., our old car key) was a useful analogy. So they referred to them as "keys". That explains the first "K" ;)

Now that you mention it, I am equally mystified by the second "K". Despite being around KeyKOS and thinking about KeyKOS for many decades now, I never thought to ask.

On Sun, Feb 23, 2020 at 1:34 PM C Oda <ArndtS...@freenet.de> wrote:

Does anyone know what the K in KeyKOS stands for? I have never seen an explanation for that anywhere.

--

You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/a6e6851a-8dd5-4ba7-bb93-8b88467d78b9%40googlegroups.com.

Charlie Landau

unread,

Feb 23, 2020, 6:32:49 PM2/23/20

to cap-...@googlegroups.com

The second K stands for Kernel or Kernelized, since we thought of KeyKOS as a microkernel.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CAK-_AD7emqZqHaBX7BiqBjLeYn-Eq9W-ty0mU8CXxccRrrh4cQ%40mail.gmail.com.

C Oda

unread,

Feb 24, 2020, 7:57:56 AM2/24/20

to cap-talk

New questions! This time about segments:

http://www.cap-lore.com/Agorics/Library/KeyKos/key370.html states:

A segment may group other segments into one address space.

It seems that in the actual implementation, a lot of caching would be involved in order to prevent a lot of indirection on each memory access. Did deep segment hierarchies incur a time meter penality to make up for the lookups?

Someone mentioned that there was a maximum tree/hierarchy depth for meters of 16 I believe. Was there a maximum depth for segments as well?

Also, I believe a significant part of the page linked above should be added to Wikipedia, since the KeyKOS site is almost nonexistent: https://en.wikipedia.org/wiki/KeyKOS

C Oda

unread,

Feb 24, 2020, 9:11:55 AM2/24/20

to cap-talk

I'm trying implement the KeyKOS architecture, and I'm now stumbling on implementation questions.

Conceptually, there is some elegance in the approach, since everything is just either a Key, a Node/KeyList or a Page.

Domains and Segments are Nodes as well, but in what way should they be implemented? Just use the Node datastructure, and create functions that operate differently on it?

Should Domains, Segments and normal Nodes be stored separately? Or should they just use the same Key "namespace"? Storing them together has the advantage of implementation simplicity, but the logic/policies that separates/differentiates between them still has to exist. It would be nonsensical (though it could be possible) to store a MeterKey in a Segment tree for example. Implementing them apart could make optimization easier.

Other questions arise from this, such as how a Segment or Domain should be constructed. By "casting" a Node to a new Domain structure perhaps, or just by obtaining a SegmentKey to the same old Node that allows a different "view" of the same data structure. But how to prevent possibly Segment-corrupting actions performed through the NodeKey of the Keylist in that case(such as storing a MeterKey in there, as mentioned above)?

C Oda

unread,

Feb 24, 2020, 9:30:56 AM2/24/20

to cap-talk

I think the correct lingo would be:

"When a Node is prepared, what happens to copies of the Keys that already point to it?"

Or a "copies" of Key actually referencing the same key?! Such that a Node is only ever pointed at by one Key - when the Node is prepared, the Key pointing to it changes to a DomainKey, and so all owners of copies of the Key now see the Node as a domain.

Charlie Landau

unread,

Feb 24, 2020, 6:45:54 PM2/24/20

to cap-...@googlegroups.com, ma...@agoric.com

On 2/24/20 6:11 AM, C Oda wrote:
> I'm trying implement the KeyKOS architecture, and I'm now stumbling on
> implementation questions.

There is an open source implementation of CapROS
(http://www.capros.org/) at https://sourceforge.net/projects/capros/ .
CapROS is very similar to KeyKOS. CapROS can run on real hardware, but
none that is currently available. I intend to move the source from CVS
to git when I get time, which could be months.

May I ask why you are trying implement the KeyKOS architecture? Starting
from the CapROS code may be your best plan.

@markm, what happened to the KeyKOS documents that used to be at
https://www.cis.upenn.edu/~KeyKOS ?

Mike Stay

unread,

Feb 24, 2020, 7:04:35 PM2/24/20

to cap-...@googlegroups.com, ma...@agoric.com

On Mon, Feb 24, 2020 at 4:45 PM Charlie Landau
<cha...@charlielandau.com> wrote:
> @markm, what happened to the KeyKOS documents that used to be at
> https://www.cis.upenn.edu/~KeyKOS ?

They're still on the archive here:
http://web.archive.org/web/20161117060659/http://www.cis.upenn.edu/~KeyKOS/

Looks like sometime around 2016-12-01 they were taken down.

--
Mike Stay - meta...@gmail.com
http://math.ucr.edu/~mike
https://reperiendi.wordpress.com

Mark S. Miller

unread,

Feb 24, 2020, 8:03:27 PM2/24/20

to Mike Stay, cap-...@googlegroups.com

Also at Norm's site: http://cap-lore.com/CapTheory/upenn/

Arndt Schnabel

unread,

Feb 25, 2020, 5:48:03 AM2/25/20

to cap-...@googlegroups.com

> May I ask why you are trying implement the KeyKOS architecture?

Just for fun and deeper understanding :)

Jonathan S. Shapiro

unread,

Mar 4, 2020, 5:02:14 PM3/4/20

to cap-...@googlegroups.com

Arndt:

Apologies that I am responding so late. Things have been very hectic here preparing our staff and our business for the COVID disruption.

I'm going to attempt to answer some of your questions below for KeyKOS, EROS, and Coyotos, and also try to explain why the answers changed over the evolution of the system. When I reply to your one of your later notes, I will be suggesting that you start from Coyotos as your base rather than KeyKOS. The evolution in Coyotos was based on our experience at The EROS Group doing commercial applications on KeyKOS, and discovering that the "impedance mismatch" between the KeyKOS abstractions and modern programming languages made application development much more difficult than it needed to be. But I'm getting ahead of myself.

On Sun, Feb 23, 2020 at 7:44 AM C Oda <ArndtS...@freenet.de> wrote:

How does domain code refer to Keys, on the lowest level?

Does every key have a domain-relative unique ID? Or are they referred to always relative to a node? Say, Node 15, Slot 3?

In KeyKOS and EROS:

Domains have a set of 16 "capability registers". These are implemented in software by the kernel. Invocation of a capability is done by specifying the register number of the register that holds it. Operations that deal with slots specify a slot number. This is true for Nodes, but the details very for domains because some processor architectures have a lot more state than others. For example, architectures like SPRC32/SPARC64 have register windows, and later versions of X86_64 have vector registers. Collectively, these carry enough state that it makes sense to dump/reload them to a page rather than a node.

In EROS only, capability register zero is non-writable, and always contains the "void" capability (which KeyKOS did not have). This helped to reduce the complexity of the assembly-level invocation path considerably.

In EROS Only:

In EROS, we switched terminology from "Domains" to "Processes". A process has both a capability address space and a node address space. In addition to the "invoke capability" operation, additional traps (system calls) exist to move capabilities between domain registers and the capability memory tree. This was done because:

Programs written in high-level languages often need to manipulate more capabilities, and this will be especially true of things like desktop applications.
Programs written in high-level languages really want to be able to have a capability stack so that library code can be effective. KeyKOS had a set of compiler enhancements that did some of this, but our attempt to build analogous functionality using C++ classes for EROS proved very awkward. The issue wasn't C++. The issue was that the cost of capability move was high enough that a clever caching scheme was needed. Like many clever schemes, the clever caching scheme came with its own challenges. We eventually concluded that a capability stack should actually be a stack, which required a capability memory.

We went back and forth about whether the leaf objects in the capability memory tree should be nodes or pages. I no longer recall what the final conclusion was in EROS.

In Coyotos:

The specification of arguments in the invocation path was completely re-architected. Invocations can specify either an in-register or an in-memory capability for invocation. The CALL/RETURN/FORK operation set was replaced with a single, two-phase SEND&RECEIVE operation in which either phase can be skipped. This made the system call restartable in the presence of exceptions.

Coyotos also has something we called a "notification", which is very similar to a scheduler activation. It solves certain problems we encountered with asynchronous communications that require additional helper domains in KeyKOS/EROS. The helper domains incur a significant latency penalty. See our paper on the EROS high-performance networking stack for an illustration.

In an explicit invocation (from here: http://www.cap-lore.com/Agorics/Library/KeyKos/Principles/chapter2b.html),

it seems that 4 keys are referred to by their slot number.

But I'm wondering which instructions let the domain code copy the keys into that slot.

The node capability (also the domain capability) implements copy to and copy from operations. In the "copy to" case, the argument registers are the sources of the copy. In the "copy from" case, the destination registers are the target of the copy. Doing this operation on your own domain can be used to permute registers.

This is actually one of the places where the void cap register comes in handy. The KeyKOS invocation required enable bits to say which slots to copy. The EROS/Coyotos invocation can source four registers from a node unconditionally and send any undesired caps to the void register slot in order to throw them away. It makes the kernel path cleaner and more regular.

There must be an instruction that goes like: "Copy NodeA[SlotX] to NodeB[SlotY]"

In KeyKOS/EROS, the caps must first be copied to registers from NodeA and then to slots in NodeB. This is one of the cases where the Coyotos memory tree does a bit better.

Jonathan Shapiro

C Oda

unread,

Mar 4, 2020, 5:33:55 PM3/4/20

to cap-talk

Apologies that I am responding so late. Things have been very hectic here preparing our staff and our business for the COVID disruption.

I wish you all on the West Coast the best. California seems to be impacted significantly.

We eventually concluded that a capability stack should actually be a stack, which required a capability memory.

Interesting!

In my toy project it now works similarly to KeyKOS: Each domain/process has 16 registers, they are the workspace of the process and correspond to a node - initially mapped to the "primary/root node". If one of these registers contains a NodeKey, there is an instruction that replaces the current working node with that one. This way the node tree can be traversed down.

Then there is another instruction that resets the registers to the root node which allows one to go up again. There is a second set of registers that is required in order to be able to copy keys between different nodes and set these workspaces to intermediate nodes in the tree, but that's it. Two workbenches. (Then also an instruction to flip/switch the two workbenches).

Conceptually, it's beautiful, it may even be the same on the abstracted programming layer again, but inbetween must lie the uglyness of the implementation.

This seems to apply to all KeyKOS-like systems. The ideas and the interface are beautiful, but its difficult inbetween, because the optimization and maintaining resource metering accuracy is hard.

I will be suggesting that you start from Coyotos as your base rather than KeyKOS.[...] This is one of the cases where the Coyotos memory tree does a bit better.

I will have a look at it!

We went back and forth [...]

This is what fascinates me about these operating systems, there's always this balance/tradeoff/conflict between conceptual simplicity and pragmatic/economic countereffects.

And I'm wondering how these ideas came to be, which came first, and how they changed over time and why. I haven't found a deep technical/architectural version of the book "The Tym Before" yet.

Which is one of the reasons why I'd like to talk with Ann Hardy. (If someone knows her, please send her my greetings!)

Thank you very much for your response :)

Jonathan S. Shapiro

unread,

Mar 4, 2020, 5:36:32 PM3/4/20

to cap-...@googlegroups.com

On Sun, Feb 23, 2020 at 1:16 PM Baldur Jóhannsson <zaru...@gmail.com> wrote:

Hmm... think of an KeyKOS domain like an simple MCU (MicroContolUnit, arduino has one for an example).

This is a very good model. Norm was a processor architect (not sure about Charlie, Bill, others), so maintaining a notion of "atomic units of operation" in the kernel and a clear semantics for operations was something he assumed from the beginning. This led him to a kernel architecture that (at least in principle) had a potential implementation as processor hardware.

It has memory which it fetches instructions from and
keeps its data. It has a few registers in which most
immediate state is kept. (Program Counter, index registers, stack pointer, arithmetic registers and such).

Now, imagine the domain has capability registers...

Yes. Continuing in the "kernel as software microprocessor" metaphor, the capability portion of the KeyKOS domain architecture looks very much like a RISC processor (the data portion may be RISC or CISC according to the underlying microprocessor).

One way in which the capability portion is different is that the decoding of "opcodes" is quite different. In a conventional processor, the "opcode" (add, subtract, load, store, jump, call and so forth) is part of the instruction. In the KeyKOS case, the capability being invoked must first be decoded, because until you know the type of that capability it is not possible to know what the opcodes mean.

This encoding would lead to a slowdown in a hardware implementation, and would probably need to be re-examined. The most likely outcome would be to re-encode all of the kernel-known opcodes with non-overlapping values, perform 'instruction decode" in the hardware implementation eagerly, and then validate whether the operation is permitted by the capability type and the capability permissions field. If it is not, the operation would be abandoned in the same way that conventional register-renamed architectures work today. None of the software-based KeyKOS derivatives implement this kind of consolidated opcode space; it would make the software path slightly slower to do so.

So, there are no special instructions to move a cap
from SlotX in NodeA to SlotY in NodeB.

Yes. A particularly interesting case is when a domain invokes its own domain capability in order to permute the values in the capability registers. In Coyotos there is a kernel-implemented move instruction (and maybe in EROS - I don't recall). In KeyKOS it is a capability invocation, and the details of the specification for operations becomes critical. Have a look at "How a Domain Invokes a Key" at this page.

When a domain invokes a domain capability to itself, the constituents of the domain may need to be "deprepared" in order for the operation to proceed. In the KeyKOS implementation, in one of the very obscure cases, there is a potential live-lock here. In all of the years that I worked with Norm, I think this is the only bug in KeyKOS that I ever managed to discover. Extra credit for anyone who can re-construct the operation and the deprepare/prepare sequence that produces this live-lock.

Jonathan Shapiro

Jonathan S. Shapiro

unread,

Mar 4, 2020, 5:43:52 PM3/4/20

to cap-...@googlegroups.com

On Wed, Mar 4, 2020 at 2:33 PM C Oda <ArndtS...@freenet.de> wrote:

Apologies that I am responding so late. Things have been very hectic here preparing our staff and our business for the COVID disruption.
I wish you all on the West Coast the best. California seems to be impacted significantly.

The entire US west coast is in a world of hurt. State of Washington has the advantage of a sane political hierarchy that is actively working around our Federal government to get the right things to happen. Similar things are happening in the State of New York. I'm not sure where things stand in California at the moment.

Humans have terrible intuitions about exponentials. A lot of people gave us (Darcy and myself) funny looks when we stockpiled 90 days worth of food. Two days later, all major retail stores in our area were empty. Even we mis-judged how quickly the panic would occur. Unfortunately, this is a situation where understanding exponentials is absolutely essential. Combine that with our inept administration, and the US is on a committed path to self-destruction.

Sadly, I feel confident saying that the situation in EU is much worse than is currently recognized. EU isn't (yet) doing pervasive testing, so you don't have good data. The doubling rate on this thing is now understood to be about 3.4 days, and the "index" patients in EU were identified 5 or 6 weeks ago. Right now, the numbers sat that you have 5,000-6,000 infected individuals for each index case.

Apologies - this isn't the right thread for this. My personal likelihood of mortality runs about 30% if I am infected, so I'm paying a lot of attention to how we are screwing things up over here.

Jonathan

Jonathan S. Shapiro

unread,

Mar 4, 2020, 5:52:39 PM3/4/20

to cap-...@googlegroups.com

On Wed, Mar 4, 2020 at 2:33 PM C Oda <ArndtS...@freenet.de> wrote:

Conceptually, it's beautiful, it may even be the same on the abstracted programming layer again, but inbetween must lie the uglyness of the implementation.
This seems to apply to all KeyKOS-like systems. The ideas and the interface are beautiful, but its difficult inbetween, because the optimization and maintaining resource metering accuracy is hard.

Also pointless, which is why Coyotos completely abandons the meter system. That's a long discussion for a different thread. The meter system is simple, but it is completely inappropriate for anything like a real-time system. All modern systems have some degree of real-time requirements.

As to caching, I agree, and I'll say more about that in my next response. Having worked on three different styles of kernel, I would say that these issues are by far easier in this family than any other I have seen. This is true because in KeyKOS, EROS, and Coyotos actually have rigorous specifications that define correct behavior. Other systems do not. Without a definition of correct behavior, all implementations are equally correct. Which is to say: correct caching in UNIX or Windows is not possible in principle.

I will be suggesting that you start from Coyotos as your base rather than KeyKOS.[...] This is one of the cases where the Coyotos memory tree does a bit better.
I will have a look at it!

Source code can be checked out from here. More to follow.

We went back and forth [...]

This is what fascinates me about these operating systems, there's always this balance/tradeoff/conflict between conceptual simplicity and pragmatic/economic countereffects.

That is true, but in this case I don't think that is the cause. The early applications of KeyKOS were low-level controller code and a UNIX emulator. In both cases, they were written by people who were very comfortable in assembly code, and who (to some extent) used C as a high-level assembly language. In my opinion, this obscured some deficiencies in the kernel API that were revealed by attempts to actually use higher-level languages as higher-level languages.

And I'm wondering how these ideas came to be, which came first, and how they changed over time and why. I haven't found a deep technical/architectural version of the book "The Tym Before" yet.

The closest I know of is the tapes that Norm and I recorded of our architecture discussions. I have now recovered them, but we haven't yet tried to transfer the recordings.

Jonathan

C Oda

unread,

Mar 4, 2020, 6:03:22 PM3/4/20

to cap-talk

I'm not sure about the effectiveness of Germany's government either.

Though we seem to be (around) people who prefer non-centralized approaches and tend to be highly skeptical anyway. So who knows, haha... :)

I estimate my personal likelihood of severe illness or death due to infection to be slightly above average compared to my age group, on the other hand I'm well prepared and the likelihood of infection itself is (currently, hopefully) quite low .

If the internet breaks down, try to reach me, DD4TA, on 20m so we can continue to discuss caps over the airwaves ;) I will not let these ideas die.

Regardless of what happens, I want you all to know that I love what you have created.

Arndt

C Oda

unread,

Mar 4, 2020, 6:12:37 PM3/4/20

to cap-talk

Coyotos completely abandons the meter system. That's a long discussion for a different thread.

Yes please! The meter system to me is perhaps even more interesting than the capability access architecture. Only a handful of programming languages support similar functionality (e.g. https://stackless.readthedocs.io/en/latest/library/stackless/pickling.html), and having such a mechanism on an operating system/virtual machine level could make (automatic) processor architecture and language-independent resource cooperation (incl. trading) a reality. The only true future contender for such a system would be https://github.com/ewasm/wasm-metering which injects metering code into Webassembly binaries, or specialized Webassembly VMs.

The closest I know of is the tapes that Norm and I recorded of our architecture discussions. I have now recovered them, but we haven't yet tried to transfer the recordings.

Yes please here too :)

Jonathan S. Shapiro

unread,

Mar 4, 2020, 8:24:50 PM3/4/20

to cap-...@googlegroups.com

Warning: TL;DR :-)

On Mon, Feb 24, 2020 at 6:11 AM C Oda <ArndtS...@freenet.de> wrote:

I'm trying implement the KeyKOS architecture, and I'm now stumbling on implementation questions.

Conceptually, there is some elegance in the approach, since everything is just either a Key, a Node/KeyList or a Page...

There were four KeyKOS ideas that ultimately did not survive into Coyotos:

Meters were replaced by scheduling capabilities (meters can't express real timerequirements)
Everything is a node or a page (too restrictive, high complexity)
The memory hierarchy using nodes as segments (restrictive, space inefficient, and high complexity).
Kernel calls as the atomic units of operation (does not admit a serious hardware-concurrent implementation).
The rule that all invocations are synchronous and blocking.

The parenthesized parts are my opinion. Let me take these in turn,

Meters

Meters were abandoned for two reasons. The first is that they aren't a good model for hard or soft real-time systems; they simply don't express what you want. We initially replaced them with simple priorities in EROS to get going. We intended to implement a different scheduling model, but EROS was replaced by Coyotos before the replacement was started.

The second is that we had an explicit goal to change the time bound requirements for the kernel. In KeyKOS, worst-case operations are supposed to have a time bound of O(log n), where n is the number of objects touched and n always has a small constant bound. The notable exception to this is the meter tree, where a meter flush that happens at the top of the meter tree can (in theory) touch every meter below it. While the height of the meter tree is bounded, the width of the meter tree is not bounded. In consequence, there is no constant bound on the number of meters that may be touched. In Coyotos, the goal was to have an O(1) time bound for all kernel operations. More precisely: O(n) for constant-bounded n.

One of our goals at Johns Hopkins was to start proving properties about these systems, and one of the properties we wanted to prove was the constant time bound property.

Given this, meters had to go.

Nodes and Pages

In KeyKOS and EROS, everything is a node or a page. Segments and domains/processes are "skins" on a node. The skin you are looking through and the operations you can perform are determined by the capability type.

At the disk level, KeyKOS disk "ranges" are either node ranges or page ranges, and the type cannot be changed without a dump and restore (and then only with difficulty). The checkpoint area isn't quite as strict; there are data pages and page-sized "node pots." This made checkpoints complex. One of the big differences at the storage layer between KeyKOS and EROS was that the EROS checkpoint area is a circular log. The evolution is discussed here.

An important difference is that EROS ranges can be allocated dynamically. Since different systems can require very different proportions, the ability to allocate ranges as you need them is very helpful. To facilitate this, EROS uses a lookup table to map from object id to disk range, while KeyKOS uses an encoded disk address as the object ID. This is one of the small differences in the on-disk capability formats. This level of indirection permits on-the-fly storage rearrangement by abusing the range replication logic. It also simplifies construction of the initial system image slightly.

The "only nodes and pages" design created a bunch of complex challenges for process state. There is a published description of how this works here. It's an early paper, and definitely not one of my better efforts, but it's a start. As I mentioned earlier, the prepare/deprepare logic was the origin of the only bug I ever found in KeyKOS. The complexity of making everything a node in KeyKOS touched everything.

As Norm described it to me, one reason for this restriction was that proliferating object types would mean proliferating the number of disk range types, which would increase the likelihood that you'd end up with an unfortunate (and constraining) division of space on the disk level. The EROS version does not share this problem, and in Coyotos we eventually made process objects and endpoint objects first-class. This allowed us to remove node objects entirely. That removed a lot of ugly and complicated code, which was helpful, because it simplified the implementation of a multiprocessor kernel by a lot.

Since we no longer needed it for processes, Coyotos is able to replace the Node object with the GPT object. GPTs require fewer objects to describe address spaces that have multiple regions. They also allow us to cleanly support multiple page sizes (which nodes cannot do in KeyKOS or EROS).

If you decide to examine Coyotos, your first priority should be to resurrect CapIDL, because you will need that to generate the documentation tree. The "base" portion of the documentation tree documents all of the kernel interfaces in a standard, web-browsable way.

Memory Hierarchy

Mostly the GPT change, but also the change to the invocation mechanism to allow using capabilities from the memory tree. In Coyotos, there is a single, unified address space. All leaf objects in the memory tree are pages. Pages are "tagged" at allocation time as either capability pages or data pages. This allowed us to use the same memory walk logic for both address spaces, and it also allows us to create a single address space that contains "capability regions" and "data regions."

System Calls as Atomic Units of Operations

This was the hardest decision, and it had huge consequences for the system specification, the kernel design, and the kernel implementation.

The Coyotos design was a joint effort between myself and Jonathan Adams (JWA). One of the very valuable things that JWA brought to the effort was a detailed understanding of how multithreading was done in Solaris, and the various things Solaris had gotten wrong and had to fix over the years. Norm's story for multithreading in KeyKOS basically required a "big kernel lock." This was also the first approach in Linux, and it severely hampered performance.

I believe that Norm liked the "big kernel lock" idea for two reasons:

It is simple, and
It preserves the story that the current state of the system can always be described inductively as a sequence of transforming operations (capability invocations) that have been applied to an initial state. You may not know the sequence, but knowing that it exists enables a bunch of verification proofs about the correctness of the system state.

So far as I know, KeyKOS and EROS are the only kernels for which this claim can be made. It's one of the reasons that the verification proof [Shapiro and Weber 2000] and the later mechanized verification proof (Scott Doerrie's dissertation) are possible. According to Jochen, L4 did not have this property. I do not know if seL4 does, or if it instead adopts something similar to the approach adopted in Coyotos (described below). Gerwin Klein's TOCS paper probably sheds light on this. Given the hardware they were working on, I would expect seL4 to consider the same multithreading concerns that we did. Note that Gerwin's team didn't fully understand how our initial system image was constructed; a correctness verification there was certainly possible.

Verification is one issue, but performance is another. The system calls in KeyKOS, EROS, and Coyotos have much shorter paths, but the degree of multithreading is much higher, and the cache coherency traffic alone for the big kernel lock approach places a severe limit on overall system performance. Given the percentage of time spent in the kernel in these systems, you really don't want to make the kernel be effectively single threaded. It would mean that operations on completely unrelated objects (each of which is probably local to the acting hardware cache) have to be done sequentially. Even on Linux, which spends much less of its time (proportionally) in the kernel (and therefore should have much less contention) the impact was severe.

The alternative is to relax the requirement that system calls are atomic units of operation. All serious multithreading kernels do this. JWA played a critical role in architecting and implementing what follows, and Eric Northup provided a bunch of tutoring on hardware concurrency to a certain doddering former graduate advisor on this. :-)

In Coyotos, the concurrency contract is:

Every kernel operation consists of a setup phase (in which all necessary objects are locked) followed by an atomic "action" phase (but see below re data read/write).
Durable preconditions established in the setup phase may be invalidated by a competing hardware thread prior to the mutate phase. This is checked before the mutate phase, and the thread that depends on an invalidated condition will abandon (and restart) its operation from scratch.
Once the action phase is begun, you're committed (so to speak). The action phase must either complete or it must induce a hard system reboot due to an unrecoverable error (in our case, most likely an ECC memory fault). Action phases are O(1). Once a thread enters its action phase, a competing, higher-priority thread will wait for the action phase to complete rather than breaking a lock held by the acting thread.
It is explicitly permitted for memory walks to occur where process A traverses a GPT during its setup phase, process B modifies that GPT, and process A then modifies a target object lower in the memory tree being walked (that is: after the path that it walked has been modified behind it). The GPT slot read is atomic, but the GPT path traversal is not. The slot-level read and write operations are atomic, but these do not create durable preconditions. Getting the interaction correct between the GPT tree and the hardware mapping tree (for some architectures) and the soft-TLB (for others) was very entertaining.

Note that most traversals are done in the hardware page tree, which is a partial cache of the GPT state, so this particular form of weak consistency is (in practice) a requirement.
If simultaneous readers and writers exist for data page[s], it is guaranteed that some byte string will be returned to the reader, but there is no guarantee about which of the concurrently written bytes will be seen by the reader. This is partly because you can't make the hardware take locks during a load instruction, and partly because many hardware implementations do not implement hardware-enforced consistency across caches.

So each Coyotos system call is comprised of a well-defined series of [mostly] atomic operations, but the state of the system cannot necessarily be reconstructed by re-applying some sequence of system calls.

From a verification perspective, data read/write operations have no security semantics, so that overlap doesn't matter. It also turns out that the memory walk overlap (which is initially discomforting) is safe from both a verification and a concurrency perspective.

One of the clever things we did in Coyotos (shap said, dislocating his shoulder while patting himself on the back) was exploit the mostly-atomic nature of Coyotos to implement locking differently. Coyotos consistency is implemented by a mix of locks (for both durable and non-durable preconditions) and atomic operations (for non-durable preconditions). Every process has a current transaction ID that is incremented on each call, This ID is embedded in the lock word. If the lock transaction ID and the owning process transaction ID do not match, the lock is not held. Which means that locks are "gang released" by incrementing the per-process transaction ID. I've never seen any other system that works this way. See the comments in kerninc/mutex.h.

A related trick is used for transient mappings of things like page tables, in which mandatory TLB invalidations are exploited to lazily invalidate transient mappings. The cost of TLB invalidates being very high, this proves to have a significant performance benefit. See hal/transmap.h.

So far as I recall, the only explicit lock release in the Coyotos kernel occurs in the GPT walk path (where locks are non-durable). The big reason for making these locks non-durable is that we don't want to impose serialization on kernel operations when multiple threads share a common address space or address sub-space. It also matches the fact that hardware-implemented walks of page tables that cache the GPT state are proceeding concurrently, and reduces the number of required IPI operations significantly.

Anyway, if you are scratching your head looking for the lock releases and/or the page unmaps, that's why they aren't there.

This locking implementation is at least an order of magnitude faster than the usual approach where locks need to be explicitly released. In conventional kernels, abandoning or re-trying a system call requires returning up the stack through all of the currently outstanding procedure calls. This is necessary for two reasons:

Most kernels do not cleanly distinguish between setup and operation phases. In such kernels, there are intermediate state changes that need to be undone when a system call fails, errors, needs to be restarted, or succeeds.
In most kernels, the knowledge of lock locations is recorded in stack-local variables. The containing stack frames need to be re-visited in order to release locks.

Note that the stack is generally "cold" in the cache by this point, because conventional kernel stacks can get pretty big.

In Coyotos, a system call may be abandoned for one of three reasons:

A lock conflict during setup phase in which the requesting thread has lower priority and yields. At this point, no "live" state exists on the stack, so there is no need for conventional procedure call returns to clean up.
A "broken durable precondition" is found, which means that a higher-priority thread stole your lock. This can only occur prior to the action phase, so once again, there is no "live" state on the stack.
The action phase has been completed, in which case there is no "live" state on the stack.

In all three cases, the kernel stack is abandoned in place by assembly-level code that resets the stack pointer. In the first two cases, the kernel jumps back to the instruction that follows the inbound argument save. The difference in the third case is that the currently active system call advances from the "send" phase to the "receive" phase. The implementation isn't even done with a longjmp(). It's an inline-compiled jump instruction to an assembly-level code sequence that bashes the kernel stack pointer directly.

This is one of the reasons that Coyotos was implemented in C rather than C++. In EROS, a lot of effort was needed to ensure that no object requiring a destructor ever ended up on the stack. The ultimate death of C++ for us happened when it was no longer possible to turn off exception handling - the usual implementations of exceptions have a large cost in stack size, code size, and cache temperature. Given what I've just said about the abandonment strategy, it should be clear that (by design) we didn't need an exception handling mechanism.

An no, we don't kick the process all the way back to user mode in this case. That user->kernel transition is damned expensive! That particular optimization was something we did very early in EROS. Linux later adopted something similar in selected cases. I have wondered whether that idea may have indirectly made its way from our work to theirs. Some very sharp people have worked on that part of the Linux kernel.

Domains and Segments are Nodes as well, but in what way should they be implemented? Just use the Node datastructure, and create functions that operate differently on it?

A single note cannot implement the amount of state needed for a modern hardware register set. At this point I'd strongly recommend the Coyotos approach in which the process object is first-class.

Should Domains, Segments and normal Nodes be stored separately? Or should they just use the same Key "namespace"?

The core object types have different sizes and different access constraints. They should be stored separately. They shouldn't even be in the same low-level namespace.

Storing them together has the advantage of implementation simplicity...

Have a look at the EROS single-level store paper to see why this isn't as true as it seems at first.

It would be nonsensical (though it could be possible) to store a MeterKey in a Segment tree for example.

Not at all. There are lots of reasons to write a scheduling capability (whatever the form) into capability memory. Conceptually, this is no different from writing an integer to data memory.

Synchronous and Blocking Invocations

One of the things we learned in the EROS high-performance networking implementation is that some forms of communication between domains/processes want to be asynchronous. In EROS, we had to introduce additional domains whose sole job was to "camp" on a receiver in order to give it a "ping" telling it to do something. In my view, this involves an absurd amount of overhead to do something very simple. I has high cost in space, in complexity, and in performance.

Coyotos therefore introduces something called a "notification". In order to send one, you have to hold an AppNotice capability. Receiving a notification causes an application level context switch similar to (and inspired by) scheduler activations. This allows a direct implementation of asynchronous interactions.

It turns out there is no good way to do a scheduler activation on earlier ARM generations without kernel intervention. I had an extended exchange with Richard Grisenthwaite about this at one point. I don't know if it has been addressed in subsequent revisions of the architecture.

Jonathan

Jonathan S. Shapiro

unread,

Mar 4, 2020, 10:00:37 PM3/4/20

to cap-...@googlegroups.com

On Wed, Mar 4, 2020 at 3:12 PM C Oda <ArndtS...@freenet.de> wrote:

Coyotos completely abandons the meter system. That's a long discussion for a different thread.

Yes please! The meter system to me is perhaps even more interesting than the capability access architecture.

Umm, I'm not sure I agree. :-) In any case...

The first five things to say about scheduling are:

There is no universal scheduling algorithm for CPU time. Unless something has changed very dramatically since I last reviewed the literature, there is no CPU scheduling approach that can be used in all of the scenarios where we would like capability-based operating systems to be used.
There is no CPU scheduling algorithm that is also appropriate for memory scheduling. This is true for two reasons: (1) access to memory can be revoked, and (2) memory is fungible in a way that time in a real-time system is not. CPU scheduling is more like memory residency scheduling.
There is no use case I can identify for which meters are a better answer than newer approaches. This last statement is very much an opinion. Charlie and Bill will probably disagree, and they will be able to offer good arguments.
The problem in systems that have abstracted memory models is qualitatively different. If you can't schedule memory residency, you can't guarantee CPU access.
Good solutions really want scheduler activations. The last thing you want is kernel scheduling and application-level scheduling competing with each other without any mechanism for cooperation. The X Window System serves as a cautionary tale there. It has to serve multiple clients that have different priorities without creating priority inversion or "leaking" scheduling privilege, so there is no "correct" priority at which XWS itself can operate. See also the paper by Marsh and LeBlanc on First-Class User-Level Threads, which was parallel work and perhaps a better variant.

The meters design is probably the thing I was most skeptical about in KeyKOS when I first encountered it, mainly because I was thinking about real-time issues at that time. These days, it seems to me that there are other scheduling approaches that achieve similar results while preserving the operational time bounds that we would like to see. Better still, those approaches are never impacted by a meter getting removed from memory under memory pressure. Since meters are just nodes under the hood, that can happen, and this can lead to starvation.

Broadly, there are four types of scheduling problems in the context of operating systems:

Static schedules, where the number of threads is known in advance and everything operates on a precomputed schedule. These are very common in aerospace applications, and also in critical control systems. If you can find them, the papers on scheduling in the MERT and DMERT telephone switching systems are worth reading, but they are hard to find.
Hard real-time schedulers, most commonly Earliest Deadline First.
Multimedia schedulers, which can be thought of as soft real-time schedulers, or alternatively, as systems that are hard real-time when viewed over a long time window, but tolerate some amount of slack when viewed over a short time window.
Dynamic priority schedulers, which are the kind classically used in "general purpose" computing systems.

Meters don't quite describe any of these scheduling disciplines. One way to describe the meter model is "hierarchically divisible dynamic priority scheduling." KeyKOS didn't contemplate multimedia applications. As I understand it, GNOSIS (the KeyKOS predecessor) was initially targeted at Tymshare network switching nodes. In such nodes, the hard real-time decisions are generally made directly in hardware. Soft real-time is usually a sufficient solution for the problem of circuit/connection setup and tear-down, and the meter system would have been fine for that. In [D]MERT, accounting updates could be initiated by telephone connection establishment. The accounting system was non-real time, so this can be viewed as an intentionally architected form of priority inversion. To deal with that, the policy was that it was okay to "drop" a call record when it could not be recorded within the necessary deadline. An interesting and practically useful case of trading reliability and deadlines. In practice, it arose rarely enough that the accountants didn't care.

There are also hybrid schedulers, which are generally a good idea. Something like the Windows Task Manager doesn't run very often, but when it does run you really need it to run at higher priority than anything else, because if it doesn't you won't be able to halt a higher-priority runaway process. Note that in a persistent system you can't recover if you get this wrong because rebooting will re-construct the problem.

At the time we were working on EROS and Coyotos, we were very interested in multimedia scheduling. At that time, the two scheduling models that I thought were the most interesting were the work by Jones et al on CPU Reservations and Time Constraints and the work by Mercer & Tokuda on Processor Capacity Reserves. Also worth reading Jones' later work with Regehr on their implementation in RialtoNT. There are also some interesting ideas in Cheriton and Duda's work on Borrowed Virtual Time, especially when combined with the scheduler activation ideas.

I'm guessing from something you wrote that you are mainly focused on language-level scheduling isolation. That's a very different problem. First, isolation and real-time dependencies are opposing objectives. In some sense, the problem of real-time scheduling is that you have a requirement for non-isolated scheduling that you need to manage. Second, CPU scheduling does not offer guarantees to non-memory-resident processes. In language-level systems, it is usually true that the application level memory model is so completely abstracted from the system-level memory model that residency requirements can't really be articulated in any sensible way. This is true because there is no way to specify a mapping from objects to the pages that contain them, and the pages are the underlying unit of residency. The thing is: there's no point handing the CPU to a high priority process only to have it incur a high-latency page fault, block, lose its page residency before it is rescheduled, and loop in this fashion.

In Coyotos we discussed at some length a notion of memory pools for real pages that could be expressed in the GPT tree, but we did not get to implementing them. I can see ways to deal with language object residency requirements in the context of some of the real-time collectors, but not in others. In the context of GC, it will sometimes be true that a high-priority-residence object contains a reference to a low-priority-residence object, and that this is correct. How to model and expose this to the language system presents interesting challenges that probably require a scheduler activation approach. Even given that level of collaboration, it's not clear to me if we should think of this as a "weak-ish" form of pointer or if some other kind of approach entirely is called for. It would make a good research challenge for somebody's PhD.

Jonathan

Mark S. Miller

unread,

Mar 4, 2020, 10:35:22 PM3/4/20

to cap-talk, Heiser, Gernot (Data61, Kensington NSW)

Also http://ts.data61.csiro.au/publications/csiro_full_text//Lyons_MAH_18.pdf , which I particularly like.

Jonathan

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QMq%3DR3ocZhWnt-rjtgkYUqmthD1%2BP8B5ou8PMmVwQE5fQ%40mail.gmail.com.

--

Cheers,
--MarkM

Jonathan S. Shapiro

unread,

Mar 4, 2020, 10:39:14 PM3/4/20

to cap-...@googlegroups.com, Heiser, Gernot (Data61, Kensington NSW)

Also http://ts.data61.csiro.au/publications/csiro_full_text//Lyons_MAH_18.pdf , which I particularly like.

I hadn't known about that one. Thanks!

Gernot Heiser

unread,

Mar 4, 2020, 11:09:48 PM3/4/20

to Mark Miller, cap-talk

On 5 Mar 2020, at 14:35, Mark S. Miller <eri...@gmail.com> wrote:

Also http://ts.data61.csiro.au/publications/csiro_full_text//Lyons_MAH_18.pdf , which I particularly like.

Thanks for the (repeated) nice words, Mark.

I saw the discussion and was tempted to contribute, but lack the bandwidth ATM to closely read what has been said.

I generally agree with Jon: meters are a way to capture CPU access with caps, but aren’t really suitable for real-time. Our work (referenced by Mark) on scheduling-context caps tries to solve this. It is based on the concept of scheduling-context objects introduced in Fiasco (pre-caps), and NOVA has a cap-protected version of them, although I’m not convinced the NOVA model is suitable for hard real time (HRT).

The seL4 model aims to be suitable for HRT in the presence of untrusted code that may run at higher prio, e.g. an untrusted driver may preempt a critical control loop, but we can still guarantee sufficient time for the control loop to meet its deadline.

There is also the (concurrently developed) Composite model, which also has caps for time and does all scheduling at user level.

Gernot

Jonathan S. Shapiro

unread,

Mar 4, 2020, 11:36:40 PM3/4/20

to cap-...@googlegroups.com, Mark Miller

Gernot:

I have wondered, and I don't think I have previously asked, which of the L4-family kernels from your part of the world have supported fine-grain hardware concurrency in the kernel (that is: something finer than the "big kernel lock" approach)?

If you get the bandwidth to read what I typed (always a challenge :-), please correct any errors I may have made in regards to the L4 family.

Jonathan

Gernot Heiser

unread,

Mar 5, 2020, 12:07:57 AM3/5/20

to cap-talk, Jonathan Shapiro, Mark Miller

On 5 Mar 2020, at 15:36, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Gernot:

I have wondered, and I don't think I have previously asked, which of the L4-family kernels from your part of the world have supported fine-grain hardware concurrency in the kernel (that is: something finer than the "big kernel lock" approach)?

we’ve done some internal experiments, especially with Intel TSX, but that never got anywhere near mainline. And I don’t see this approach to fly, as it seems too hard for verification (and on top TSX introduces awesome new timing channels).

We’re about to put more resources behind multicore verification (waiting for one hire to get his visa) which will restart thinking in this space. Kev's last preferred model was I think 4 reader and 1 reader-writer lock. And I’ve got an undergrad working on getting IPC mostly out of the lock (inter-core IPC is stupid and I’m happy to remove it from the model). The MCS kernel is actually the main reason we will still need some locking on fastpath IPC: Its passive servers execute on the client’s core, which gives us an automatic immediate priority ceiling (IPCP) implementation, so it’s something we’ll want to keep as it’s very good for RT.

Also, keep in mind that in a kernel where a fast system call is 300 cycles, scaling the single image to a manycore where migrating a single cache line costs about the same makes no sense, so the multicore kernel is not meant to scale beyond where you share an L2 (or maybe a really fast L3). Beyond that you should use a multikernel design.

If you get the bandwidth to read what I typed (always a challenge :-), please correct any errors I may have made in regards to the L4 family.

I’ll try to find the time…

Gernot

William ML Leslie

unread,

Mar 5, 2020, 12:25:35 AM3/5/20

to cap-talk

Shap: it's nice to hear from you!

On Thu, 5 Mar 2020 at 12:24, Jonathan S. Shapiro
<jonathan....@gmail.com> wrote:
>
> Memory Hierarchy
>
> Mostly the GPT change, but also the change to the invocation mechanism to allow using capabilities from the memory tree. In Coyotos, there is a single, unified address space. All leaf objects in the memory tree are pages. Pages are "tagged" at allocation time as either capability pages or data pages. This allowed us to use the same memory walk logic for both address spaces, and it also allows us to create a single address space that contains "capability regions" and "data regions."
>

This is really important IMO, and it's what Baldur was talking about
when they said "think of a microcontroller". Capability storage is
just another type of memory page that you can't map read/write/execute
but you can refer to with Coyotos system calls; and you also have 16
capability registers in the endpoint for receiving capabilities passed
when the endpoint is invoked.

[This is where I'd normally rant about hardware implementations of
this pattern and/or why tagged memory is unnecessary, but I'll leave
that this time]

> System Calls as Atomic Units of Operations
>

> One of the clever things we did in Coyotos (shap said, dislocating his shoulder while patting himself on the back) was exploit the mostly-atomic nature of Coyotos to implement locking differently. Coyotos consistency is implemented by a mix of locks (for both durable and non-durable preconditions) and atomic operations (for non-durable preconditions). Every process has a current transaction ID that is incremented on each call, This ID is embedded in the lock word. If the lock transaction ID and the owning process transaction ID do not match, the lock is not held. Which means that locks are "gang released" by incrementing the per-process transaction ID. I've never seen any other system that works this way. See the comments in kerninc/mutex.h.
>

c.f. PostgreSQL MVCC?

> Synchronous and Blocking Invocations

>
> It turns out there is no good way to do a scheduler activation on earlier ARM generations without kernel intervention. I had an extended exchange with Richard Grisenthwaite about this at one point. I don't know if it has been addressed in subsequent revisions of the architecture.
>

Are there good ways to do scheduler activations on *any* modern
(available this millenium) architecture without kernel intervention?

--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law. You absolutely MAY reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without
prior contractual agreement.

Jonathan S. Shapiro

unread,

Mar 5, 2020, 1:54:09 AM3/5/20

to Gernot Heiser, cap-talk, Mark Miller

On Wed, Mar 4, 2020 at 9:07 PM Gernot Heiser <gernot...@gmail.com> wrote:

On 5 Mar 2020, at 15:36, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

...which of the L4-family kernels from your part of the world have supported fine-grain hardware concurrency in the kernel?
We’re about to put more resources behind multicore verification...

Cool. I would expect that the unit of operation intuition will remain relevant, but I look forward to seeing what emerges.

Also, keep in mind that in a kernel where a fast system call is 300 cycles, scaling the single image to a manycore where migrating a single cache line costs about the same makes no sense, so the multicore kernel is not meant to scale beyond where you share an L2 (or maybe a really fast L3). Beyond that you should use a multikernel design.

If by "multikernel" you mean something along the lines of a closely coupled cluster of tightly coupled nodes (that share an L2), then I probably agree. This issue was one of the hot debates way back in 1990 when we were designing the HaL multiprocessing architecture.

By "closely coupled", we meant clusters in which cache coherency is preserved across clusters when the page permissions said it was required, but at a significant performance cost relative to cluster-local coherency (i.e. where you have a shared L2). From a performance-centric perspective, I agree that it doesn't make a whole lot of sense. But too much software at that point was still committed to the MESI consistency model, so it made for an application-level portability issue. For C code, modern compilers still can't auto-insert consistency maintenance instructions in C or C++ code with any sort of reasonable performance. This is the core problem that has (so far) prevented Google (and friends) from deploying ARM processors as back-end server infrastructure; too many libraries don't work - they depend too heavily on C/C++ libraries, and have concluded that manual rewriting would introduce an unacceptable level of bugs.

It comes down to whether you think you can sell the loosely coupled programming model in the target application domain. A much easier sell today than it was 30 years ago.

And of course, the whole async/await pattern didn't really exist then, and that pattern has qualitatively changed the requirements.

Jonathan

Rob Markovic

unread,

Mar 5, 2020, 2:14:58 AM3/5/20

to cap-...@googlegroups.com, Gernot Heiser, Mark Miller

Has anyone worked on BeOS and know/recall it's scheduler model?

Now of course it's HaikuOS - https://www.haiku-os.org

I also don't remember enough about the VMware VMM, but considering what all it has to do, doing full binary translation, making each VM/OS type feel native, it may have some lessons to take away.

++ Rob

About me

Jonathan S. Shapiro

unread,

Mar 5, 2020, 2:53:37 AM3/5/20

to cap-talk

On Wed, Mar 4, 2020 at 9:25 PM William ML Leslie <william.l...@gmail.com> wrote:

Shap: it's nice to hear from you!

I'll go back into hiding shortly, but being in a self-imposed COVID quarantine gave me an extra hour today when I would otherwise be commuting.

> Memory Hierarchy
>
> ...In Coyotos, there is a single, unified address space.

Oh wow did I write that badly. Coyotos is not a "single address space operating system (SASOS)". I had completely forgotten about that particular bad idea.

What I meant to say was that each Coyotos process has a single address space containing both capabilities and data in which the pages are tagged to indicate whether they contain capabilities or data.

This is really important IMO, and it's what Baldur was talking about
when they said "think of a microcontroller". Capability storage is
just another type of memory page that you can't map read/write/execute
but you can refer to with Coyotos system calls; and you also have 16
capability registers in the endpoint for receiving capabilities passed
when the endpoint is invoked.

Very close, but not exactly. Capability pages (or rather, the capabilities that point to capability pages) in Coyotos have R/W/Weak permissions. In the Coyotos kernel, these are actually recorded into the page tables as kernel-only access rights. More precisely, they are recorded into the PTE entries that comprise the TransMap. This allows the kernel-level implementation of capability copy to be done with conventional load and store instructions that are executed in kernel mode.

Compared to the analogous code in the EROS kernel, the GPT tree walker in Coyotos is simpler and more regular even though it has to deal with fine-grain concurrency issues. If you decide to dig through it, take note that the two target architectures in the tree have very different mapping models. x86 uses hardware-walked hierarchical page tables. Coldfire uses a software-loaded TLB. In spite of this, the GPT traversal logic is able to be common code. Jonathan Adams did a really spectacular job on this.

Oh! I completely forgot another difference between Coyotos and KeyKOS/EROS!

In KeyKOS and EROS, the so-called "prepared" capabilities (the ones in in-memory format) are placed in a doubly-linked list anchored at their target object. When the target object is deprepared, the ring is traversed and the capabilities are rewritten back into their on-disk format. This mechanism was completely replaced in Coyotos. An in-memory capability in Coyotos has a pointer to the target object and a pointer to a record-keeping object known as an OTEntry (object table entry). If the target object and the capability do not point to the same OTEntry (example check here), then the target object frame has been reused and the capability needs to be restored to not-in-memory form. This typically ends up happening by GC-like background processing.

Turns out that doubly linked lists have horrible cache performance when you walk them. The new mechanism makes it possible for us to do blind capability copies word-wise without updating the list. The cache traffic looks bad at first, but it turns out the check is loading state into the cache that will be needed in any case assuming the check succeeds.

Confession: it's nice to look at this code ten years later and be able to reconstruct what the heck is going on. :-)

> System Calls as Atomic Units of Operations
>
> One of the clever things we did in Coyotos (shap said, dislocating his shoulder while patting himself on the back) was exploit the mostly-atomic nature of Coyotos to implement locking differently. Coyotos consistency is implemented by a mix of locks (for both durable and non-durable preconditions) and atomic operations (for non-durable preconditions). Every process has a current transaction ID that is incremented on each call, This ID is embedded in the lock word. If the lock transaction ID and the owning process transaction ID do not match, the lock is not held. Which means that locks are "gang released" by incrementing the per-process transaction ID. I've never seen any other system that works this way. See the comments in kerninc/mutex.h.
>

c.f. PostgreSQL MVCC?

MVCC is doing multiple versions. It's a very clever approach to transactions, and it is all but required in a temporal system, but it's not the same as what is happening in Coyotos. The cost of the necessary copy-on-write operation would be significantly greater than the total system call cost in most cases. So far as I know, PostgreSQL doesn't implement a gang release mechanism in anything like the way Coyotos does. It's a case of different approaches leading to different mechanisms.

So no, there is no multi-versioning going on Coyotos, and it wouldn't be semantically correct to do so. The way that the kernel "cheats" in the GPT walk is a very unusual situation. IIRC, Jonathan Adams and I needed several weeks to convince ourselves that the GPT walk interleave was correct. The thing that ultimately helped us see it turned out to be the earlier version of the verification proof, mainly because that version didn't model memory walks at all, but treated them as stepwise operations. This meant that write-after-read hazards in the walk were modeled entirely by accident. The verification succeeds, so they must be safe. Scott Doerrie's later mechanized verification comes to the same conclusion.

> Synchronous and Blocking Invocations
>
> It turns out there is no good way to do a scheduler activation on earlier ARM generations without kernel intervention. I had an extended exchange with Richard Grisenthwaite about this at one point. I don't know if it has been addressed in subsequent revisions of the architecture.
>

Are there good ways to do scheduler activations on *any* modern
(available this millenium) architecture without kernel intervention?

The kernel has to shift the register state around to implement the activation context switch regardless of architecture. The question is whether the activation context can resume the "main" user-mode context without needing to re-enter the kernel. On all of the other architectures that we looked at (surprisingly including ColdFire), it is possible to do a "return from activation to main context" without re-entering the kernel. On ARM, it is necessary to re-enter the kernel briefly and return. This is necessary to prevent a race condition in which instruction suppression can be lost if an interrupt or exception occurs at just the wrong moment during the return from the activation context to the interrupted user-mode context. Thankfully the kernel entry/exit on ARM is very fast.

Jonathan

Jonathan S. Shapiro

unread,

Mar 5, 2020, 3:09:48 AM3/5/20

to cap-talk, Gernot Heiser, Mark Miller

On Wed, Mar 4, 2020 at 11:14 PM Rob Markovic <rob.ma...@gmail.com> wrote:

I also don't remember enough about the VMware VMM, but considering what all it has to do, doing full binary translation, making each VM/OS type feel native, it may have some lessons to take away.

It was a very cool approach when they did it, and DBT outperformed the hardware implementation for quite a while. These days, I have the impression that VMWare mostly relies on the hardware-supplied virtualization mechanisms. It's even possible (with a certain amount of care, fear, and trembling) to implement nested virtualization if the hardware has both the VT-X and the EPT features. Since the introduction of the nVMX features in 2017, most of the use cases for dynamic binary translation are fading.

Jonathan

Jonathan S. Shapiro

unread,

Mar 5, 2020, 3:13:35 AM3/5/20

to cap-talk

On Wed, Mar 4, 2020 at 11:53 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

> Synchronous and Blocking Invocations
>
> It turns out there is no good way to do a scheduler activation on earlier ARM generations without kernel intervention. I had an extended exchange with Richard Grisenthwaite about this at one point. I don't know if it has been addressed in subsequent revisions of the architecture.
>

Are there good ways to do scheduler activations on *any* modern
(available this millenium) architecture without kernel intervention?

The kernel has to shift the register state around to implement the activation context switch regardless of architecture. The question is whether the activation context can resume the "main" user-mode context without needing to re-enter the kernel. On all of the other architectures that we looked at (surprisingly including ColdFire), it is possible to do a "return from activation to main context" without re-entering the kernel. On ARM, it is necessary to re-enter the kernel briefly and return. This is necessary to prevent a race condition in which instruction suppression can be lost if an interrupt or exception occurs at just the wrong moment during the return from the activation context to the interrupted user-mode context. Thankfully the kernel entry/exit on ARM is very fast.

I remember now. The issue has to do with getting the flags register state correctly re-established when returning to a main context that is running thumb mode code. Darned if I remember the exact details.

William ML Leslie

unread,

Mar 5, 2020, 5:06:51 AM3/5/20

to cap-talk

On Thu, 5 Mar 2020 at 18:53, Jonathan S. Shapiro
<jonathan....@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 9:25 PM William ML Leslie <william.l...@gmail.com> wrote:
>>
>> This is really important IMO, and it's what Baldur was talking about
>> when they said "think of a microcontroller". Capability storage is
>> just another type of memory page that you can't map read/write/execute
>> but you can refer to with Coyotos system calls; and you also have 16
>> capability registers in the endpoint for receiving capabilities passed
>> when the endpoint is invoked.
>
>
> Very close, but not exactly. Capability pages (or rather, the capabilities that point to capability pages) in Coyotos have R/W/Weak permissions. In the Coyotos kernel, these are actually recorded into the page tables as kernel-only access rights. More precisely, they are recorded into the PTE entries that comprise the TransMap. This allows the kernel-level implementation of capability copy to be done with conventional load and store instructions that are executed in kernel mode.
>

I'm glad you clarified this; I seemed to remember that these pages
were not placed into the TransMap at all and that you somehow
pre-empted meltdown there, and further that they couldn't go into the
TransMap as the capability to the cappage might not even be prepared
or the cappage not paged in when one of the capabilities it contains
is invoked. I think I have some idealised, dreamy version of Coyotos
in my head and that I don't really understand half of what I think I
remember at all.

>>
>> c.f. PostgreSQL MVCC?
>
>
> MVCC is doing multiple versions. It's a very clever approach to transactions, and it is all but required in a temporal system, but it's not the same as what is happening in Coyotos. The cost of the necessary copy-on-write operation would be significantly greater than the total system call cost in most cases. So far as I know, PostgreSQL doesn't implement a gang release mechanism in anything like the way Coyotos does. It's a case of different approaches leading to different mechanisms.
>
> So no, there is no multi-versioning going on Coyotos, and it wouldn't be semantically correct to do so. The way that the kernel "cheats" in the GPT walk is a very unusual situation. IIRC, Jonathan Adams and I needed several weeks to convince ourselves that the GPT walk interleave was correct. The thing that ultimately helped us see it turned out to be the earlier version of the verification proof, mainly because that version didn't model memory walks at all, but treated them as stepwise operations. This meant that write-after-read hazards in the walk were modeled entirely by accident. The verification succeeds, so they must be safe. Scott Doerrie's later mechanized verification comes to the same conclusion.
>

I'm sorry I didn't explain how these were connected, not much of the
documentation online describes the actual mechanism for PgMVCC, but
commit is more or less is a case of updating the last committed
transaction id:

The "versions" in MVCC are different concurrent transactions. The aim
is to prevent changes in one transaction from showing up in another.
The way this is done is to record the transaction id against the
changed row. Several things become very straightforward with this
mechanism: If the row version is not the transaction id you are
executing or within the list of recently committed transactions, that
version of the row is not visible to you. Also, committing a
transaction is a simple matter of setting the new base transaction id,
and aborting a transaction involves... doing nothing, from what I
recall. It's quite a bit more complicated than this for write
"locks", but the basic idea is that commit is not related to the size
of the change, it's just recording that the transaction with this id
is committed and so new transactions can read rows marked with that
id.

Jonathan S. Shapiro

unread,

Mar 5, 2020, 5:15:10 PM3/5/20

to cap-talk

On Thu, Mar 5, 2020 at 2:06 AM William ML Leslie <william.l...@gmail.com> wrote:

I'm glad you clarified this; I seemed to remember that these pages
were not placed into the TransMap at all and that you somehow
pre-empted meltdown there, and further that they couldn't go into the
TransMap as the capability to the cappage might not even be prepared
or the cappage not paged in when one of the capabilities it contains
is invoked. I think I have some idealised, dreamy version of Coyotos
in my head and that I don't really understand half of what I think I
remember at all.

It's likely that I am not remembering, but I can't think of what discussion that might have been.

The TransMap is a reserved, per-CPU virtual address region where temporary mappings to arbitrary page frames can be established by the kernel. These mappings do not permit user access.

The TransMap exists because KVA space isn't big enough to map all page frames at once on modern systems. What we do instead is build a temporary mapping any time we need to touch an object that occupies a page frame (page, cap page, mapping table). If they were previously mapped, the mapping entries probably aren't in the TLB in any case due to aging, so there is no significant penalty to building the mappings on the fly. As a side benefit, gathering temporary mapping structures into a contiguous region allows us exploit the caching of L2 PTEs by the TLB on most processor families, which shortens the PTE load time for TransMap entries.

The implementation is architecture-specific. You can find the interface in sys/hal/transmap.h, and the architecture-specific implementations in i386/kernel/transmap.c and coldfire/kernel/transmap.c. The annoying use cases are the ones in kern_pstring.c, which may be asked to do a long string copy. In those loops you'll see a bunch of MAP/UNMAP pairs. On Coldfire, the TLB is soft-loaded, so old entries are flushed when new entries are introduced, OR when the TLB as a whole is flushed (which turns out to be the usual case). On x86, a to-be-flushed entry is marked, and the actual flush is batched until we do a TLB flush voluntarily or we are forced to do one dynamically to flush any marked (freed) entries that may be dangling in the TLB.

Coldfire has a large direct V to P mapping space. If the target frame falls within that space then the TransMapping is silently skipped in favor of the existing durable mapping. You can see that happening here.

The size of the TransMap is chosen to exceed the max number of simultaneously valid entries required by the most pessimistic operation.

Aside: There are moments when I think that microkernels are mainly comprised of hundreds of small clever optimizations that are possible only because the complexity of the system is contained. The TransMap idea generalizes conceptually, but I wouldn't dare try it on something having the size and complexity of the Linux or Windows kernels.

Jonathan

Venkatesh Srinivas

unread,

Mar 5, 2020, 5:23:54 PM3/5/20

to cap-...@googlegroups.com

We did this in one of the BSDs, but I agree it is relatively complex.
Thankfully we don't need to do this anymore given 64-bit addressing.

I imagine the same observation could be made about page table
substructure sharing; or some of the reverse map optimizations.

Thank you for sharing so much about these systems again!
-- vs;

Jonathan S. Shapiro

unread,

Mar 5, 2020, 5:34:57 PM3/5/20

to cap-talk

Cool.

I imagine the same observation could be made about page table

substructure sharing; or some of the reverse map optimizations.

Yes. Actually, the page table sharing case was one of the motivations for the GPT switch. While we implemented generalized PD/PT sharing in KeyKOS and EROS. It was complex, and a few cases had to be conservative. Things get much simpler from the application perspective when you can set up a GPT whose "height" is an exact match for the hardware page table height.

As an added benefit, it makes large page support and multiple page size support (as in Coldfire) relatively straightforward.

Thank you for sharing so much about these systems again!

Beats sitting around waiting to get infected. Talking about this is way more fun than talking about COVID. ;-)

Jonathan

Baldur Jóhannsson

unread,

Mar 6, 2020, 11:12:45 PM3/6/20

to cap-talk

All this talk about CPU scheduling reminded me of
this Forth task switcher I had come across somewhere
I do not longer remember where.
Note this task switcher was both PAUSE (read yield)
and interrupt driven.
Each task control block had a pointer to the next
task control block.
When a task did a PAUSE or got interrupted by the
quanta countdown timer reaching zero then the task
pointed to by the just suspended task became the
current task. One could have these basically act
as an circular singly linked list and thereby implement round-robin cheaply.
However, usually they were only linked in a singly linked list but with a 'round' scheduler task on
the tail end. That task had the job of making up
the next round by arranging (some of) the task control blocks of runnable tasks into such a list.

Some interrupts actually suspended the current task
and got a task associated with the interrupt running.
Usually such preemption task had the task it suspended as its next one in its task control block.
What this enabled was very hard-soft-realtime control
iff there were more than one timer available.
As you can imagine there was no address space seperation or such. But I suspect AddressSpaceId
based context switching would make it relatively
cheap. Add in caps to the 'round' schedulers consideration datastructures on what tasks should be
in next round and add in caps to associate spefic
task with spefic interrupt handling and we might be
getting somewhere.

I guess this wouldnt work very well on current
context-switch expensive systems.
What makes these systems expensive in that regard?
TLB flushes, page-tree-walking, pipeline stalls,
cache thrasing, branch predictor purging, and register renaming window pressure.

But what do I know? I am no ISA designer or implementer.

Með ýmsar hugmyndir.
-Baldur

Bill Frantz

unread,

Mar 19, 2020, 8:32:44 PM3/19/20

to cap-...@googlegroups.com

On 3/4/20 at 11:00 PM, jonathan....@gmail.com (Jonathan S.

Shapiro) wrote:

>As I
>understand it, GNOSIS (the KeyKOS predecessor) was initially targeted at
>Tymshare network switching nodes.

No. GNOSIS, the earlier name for KeyKOS, was designed for the
security problems of a computer utility. Our canonical problem was:

Imagine there are two chemical engineering companies. One has
a spiffy new compound and the other has a program to analyze
chemical compounds. The first company wants to protect it's
as-no-yet patented compound, but wants to use the program to
analyse the compound. The seconds is willing to sell the use of
its program, but wants to protect the code and it's ability to
continue to sell the use of the program.

The GNOSIS/KeyKOS factory is designed to provide assurances to
both companies that their properties will be protected.

The GNOSIS/KeyKOS meters were designed to provide a way of
charging for CPU usage. It was also thought that they would be
good for scheduling, but that was never implemented.

Note that Tymnet, the Tymshare computing network, had some very
interesting implementation techniques and a very efficient
technique of providing circuit switched, not packet switched,
networking. It had almost no processing in the network,
resulting in a very small attack surface. 20-20 hind sight says
that very careful coding was enough to produce a secure system.

Cheers - Bill

-----------------------------------------------------------------------
Bill Frantz | I like the farmers' market | Periwinkle
(408)348-7900 | because I can get fruits and | 150
Rivermead Rd #235
www.pwpconsult.com | vegetables without stickers. |
Peterborough, NH 03458

C Oda

unread,

Apr 1, 2020, 9:07:10 AM4/1/20

to cap-talk

In Coyotos we went initially to a first-class capability address space, and later to a unified address space in which some pages were capability pages rather than data pages. The last is where we ended up - it allowed us to reuse a lot of kernel code in a very natural way.

Does Coyotos also use the "Segment" abstraction to build a view of these address spaces? Are pages fixed size?

Jonathan S. Shapiro

unread,

Apr 1, 2020, 7:00:02 PM4/1/20

to cap-talk

On Wed, Apr 1, 2020 at 6:07 AM C Oda <ArndtS...@freenet.de> wrote:

In Coyotos we went initially to a first-class capability address space, and later to a unified address space in which some pages were capability pages rather than data pages. The last is where we ended up - it allowed us to reuse a lot of kernel code in a very natural way.

Does Coyotos also use the "Segment" abstraction to build a view of these address spaces? Are pages fixed size?

Coyotos does not use the segment notion. The "Node as a segment" pattern was replaced in Coyotos by the GPT structure, which is first-class. GPTs typically have lower space overhead than segments. Many of the simplifications in the Coyotos code base relative to EROS came about because we did not feel constrained to force everything to be a node or a page. The enabler for this change was a different organization of the store. Same orthogonally persistent store concept; slightly different implementation at the disk level.

The Coyotos memory logic supports multiple page sizes, but creating a workable system with multiple page sizes that is also persistent turns out to be tricky. In practice, we used the multiple page size support to build mappings of large hardware-defined memory regions (e.g. frame buffers), but not for user pages. Keep in mind that Coyotos and EROS applications are much smaller than UNIX or Windows applications. It's very rare to see an application that uses enough code or data to actually use a large page.

We talked a fair bit about how to do large pages in the SRL lab at Hopkins. We felt at that time that the best approach to large page PTE support would be to assemble large pages in memory (dynamically) out of smaller pages (which one of the BSD implementations already does). We did not look deeply enough to actually determine whether this was practical in EROS/Coyotos, but I think it probably would be.

Large pages are hard to maintain if you construct them on the fly, because you have to tear them down if a single constituent [small] page is removed or destroyed. BSD doesn't have to consider the case of a constituent page being destroyed. Once you figure out how to build them, you end up with conflicting incentives:

Try to preserve large pages for the sake of the TLB benefit
Try not to bias the aging of any page vs. any other page.

The whole thing becomes a lot more sane if some notion of residency quotas is introduced. The question of how to explicitly expose and manage the scheduling of logical resources to physical resources was something we were exploring generally, but had not yet come to a resolution about.

Jonathan

C Oda

unread,

Apr 2, 2020, 4:55:10 PM4/2/20

to cap-talk

This question just arose on Twitter: https://twitter.com/marksammiller/status/1245805513570258949

In which systems can the orthogonal persistence/snapshotting mechanism be written *on top* of the system? Which are reflective enough to not only serialize subsets of the system (single objects or object networks), but the entire system, including all access relations?

I'd love to know the answer for this for KeyKOS, EROS and Coyotos etc.

I know KeyKOS was serialized to disk regularly, but intuitively I'd guess it wasn't a KeyKOS domain that did this, rather a separate process.

Bill Frantz

unread,

Apr 2, 2020, 11:06:57 PM4/2/20

to cap-...@googlegroups.com

On 4/2/20 at 4:55 PM, ArndtS...@freenet.de (C Oda) wrote:

>I know KeyKOS was serialized to disk regularly, but intuitively
>I'd guess it wasn't a KeyKOS domain that did this, rather a
>separate process.

The KeyKOS checkpoint mechanism was implemented mostly in the
kernel. I'm not sure how the word "process" relates to the
KeyKOS kernel, as it mostly reacts to machine level interrupts.
The checkpoint part outside of the kernel where a domain (aka
process) uses the checkpoint key <http://www.cap-lore.com/CapTheory/KK/m/55.html#checkpoint>.

<http://www.cap-lore.com/CapTheory/KK/m/32.html> gives an
overview of the whole system.

During operation, when pages need to be removed from main
storage, they are written into a special area of disk called a
"swap area". There are logically 2 of these areas, although they
may be spread over multiple disks, called the current swap area,
and the backup swap area. When a checkpoint is taken, all dirty
pages and nodes in main storage are cleaned to the current swap
area, the system switches the swap areas so the current becomes
the backup. It continues running using the new current swap
area. It also starts an operation called "Migration".

Pages and nodes in the backup swap area are read into main
storage and then written to their home positions. When this
migration operation is completed, the backup swap area holds no
valuable data and can be switched back as the current swap area
should a checkpoint be taken.

When the system is restarted, it locates the swap area that
contains the most recent checkpoint and uses it as the backup
swap area. This action restores all the pages and nodes to a
consistant state.

Cheers - Bill

-----------------------------------------------------------------------
Bill Frantz | gets() remains as a monument | Periwinkle
(408)348-7900 | to C's continuing support of | 150
Rivermead Rd #235
www.pwpconsult.com | buffer overruns. |
Peterborough, NH 03458

Jonathan S. Shapiro

unread,

Apr 11, 2020, 12:56:49 PM4/11/20

to cap-talk

On Thu, Apr 2, 2020 at 1:55 PM C Oda <ArndtS...@freenet.de> wrote:

In which systems can the orthogonal persistence/snapshotting mechanism be written *on top* of the system? Which are reflective enough to not only serialize subsets of the system (single objects or object networks), but the entire system, including all access relations?

Apologies for the delay - we've been busy here navigating the SBA maze.

Your question isn't straightforward, because it presumes that this type of reflection is a goal. Some would argue that it is a foundational security flaw, because it requires a privilege inversion in the system TCB. I do not agree. Supervisor mode and the TCB are not equivalent. That said, real care has to be taken to maintain TCB layer isolation if you do this, and some aspects of rigorous security verification become a bit harder.

I would argue that reflection is only a virtue if it provides some form of benefit to the architecture, organization, reliability, or security of the system. For Coyotos and EROS, it would not.

In both EROS and Coyotos, the checkpoint mechanisms are transacted, and they rely very heavily on low-level access to the memory mapping structures in the kernel. The mechanism has to play nicely with all of the other reasons you might want to do things using the mapping tables. It re-uses data structures that have other critical functions. Given this, we have never attempted to externalize the checkpoint mechanism. Doing so would not materially reduce the amount of mechanism required in the kernel, and it would introduce a need to manage some difficult interdependencies across the user/supervisor boundary. That isn't a win.

It is possible, if some portions of the system are pinned, to implement drivers as user-mode applications. The EROS ethernet driver is an example. Coyotos also implements some of its drivers this way. It would be both reasonable and practical to implement disk drivers this way.

Note that out-of-kernel drivers generally remain part of the TCB. This is true because they are in a position to alter every aspect of a system, including the privileged state of the system. Given this, there are basically two arguments for handling such things at user level:

Fault detection, isolation, and debugging. This is actually a big reason.
Avoiding the two-layer scheduling problem that is inherent in hardware interrupts by managing drivers with the same scheduling model that everything else uses.

Both of these, to my mind, are pretty good arguments. That said, the hardware design really stands in the way of this approach on most microprocessors.

Jonathan

C Oda

unread,

Apr 11, 2020, 8:41:48 PM4/11/20

to cap-talk

I'm mostly considering this question from a processor/operating system-independent virtual machine/programming language level.

In systems which do not have a securely encapsulated architecture, serializing the runtime state may also require reading structures which are not accessible on the user-level. The call stack for example, in the case of a system that would allow checkpoints anytime, not just after execution has concluded (a "transactional" model I suppose). Assuming the hypothetical cap-secure VM under consideration is stateless, in the sense that all runtime data, such as the call stack is stored within a page like any other, serializing the full runtime state of a single domain does not seem to be a problem, as long as it has the key to all these pages.

If a domain either

- retains a key that allows it full access to or

- abstracts a "resource-creation" capability without which no new pages can be created and which gives it a copy of all keys and passes it to

all domains it creates, the tree/network under that domain would be serializable.

It would be neat to have a system that only utilized the kernel level to retrieve the key nodes/pages and deserialize them again. The unique identifiers/addresses/secrets underlying the keys wouldn't even have to be masked at serialization time as long as the deserialization/reconstruction process made them writable by the kernel only again. Are there any problems of having domains which own some keys have read access to their true values?

In some hypothetical language:

domain A {

function serializeEverythingIHaveAccessTo() {

# Create an empty data page

image = createDataPage()

# The root key represents the page which contains all the keys, directly or indirectly, which the domain has access to (code, data, stack, all other keys)

queue = createKeyPage([self.rootkey])

while len(queue):

key = queue.pop(0)

switch (key) {

case 'KeyPageKey': {

# Add all keys in this page to the queue to traverse further down

# TODO: Ensure DAG/prevent traversing the same KeyPage multiple times

keypage = getKeyPage(key)

queue.extend(keypage.keys)

# .serialize() would read the unique key IDs/secrets from the kernel

image.write(keypage.serialize())

}

case 'DataPage': {

datapage = getDataPage(key)

# Here, .serialize() would also include the unique ID of the data page the key refers to

image.write(datapage.serialize())

}

return image

}

Domain A may then use this function to send the object network over the network or write it to a file. The receiver can then reconstruct an object network by calling createKeyPage and createDataPage with the image data. Does this make sense? I can't tell.

On another note, this topic leads me to stare into the infinitely regressing abyss of metacaps - caps-to-caps - and I don't like that. Are there any resources on (avoiding) that?

As a side note, I am a bit confounded by the fact that a secure access model never emerged in the Smalltalk world. But I guess in the mostly academic environment in which it is developed, it was never a priority.

Jonathan S. Shapiro

unread,

Apr 14, 2020, 1:11:18 AM4/14/20

to cap-talk

On Sat, Apr 11, 2020 at 5:41 PM C Oda <ArndtS...@freenet.de> wrote:

In systems which do not have a securely encapsulated architecture, serializing the runtime state may also require reading structures which are not accessible on the user-level. The call stack for example, in the case of a system that would allow checkpoints anytime, not just after execution has concluded (a "transactional" model I suppose). Assuming the hypothetical cap-secure VM under consideration is stateless, in the sense that all runtime data, such as the call stack is stored within a page like any other, serializing the full runtime state of a single domain does not seem to be a problem, as long as it has the key to all these pages.

This is not quite true. The part your description doesn't address is the "consistent snapshot" problem.

Briefly: it isn't sufficient to hold a key to all of the pages and run around and copy them. What you need is (a) to hold permission on all pages that (b) were all frozen at an internally consistent moment, and (c) remain frozen until you are done writing them down.

The usual mechanism for this is to make everything copy on write. This can be done using mapping hardware, or it can be done using the kinds of object copy mechanisms that incremental garbage collectors use. But in either case the primordial "capture a consistent state" and "copy on write" mechanisms tend to be things that require support from the underlying runtime or VM.

Related to my earlier comment: in systems that provide incremental and/or real-time GC, it's already necessary to have mechanisms in the runtime to establish a snapshot on the heap and to implement copy-on-write. Given that most of the mechanism already needs to be present, it seems pointless to replicate it at a lower privilege level.

On another note, this topic leads me to stare into the infinitely regressing abyss of metacaps - caps-to-caps - and I don't like that. Are there any resources on (avoiding) that?

I'm not aware of any system that has these. Caps point to objects. The objects may in turn contain caps.

As a side note, I am a bit confounded by the fact that a secure access model never emerged in the Smalltalk world....

Once your core system has something like object:become, any hope of secure object identity has already been lost...

Jonathan

Reply all

Reply to author

Forward