SwingSet nanokernel

30 views
Skip to first unread message

Christopher Lemmer Webber

unread,
Dec 15, 2020, 11:49:18 AM12/15/20
to cap-...@googlegroups.com
Previously I assumed Agoric's "SwingSet" system was called that because
I think the previous version was called "Playground"... I assumed then
that this was an indicator of moving from alpha to beta status. I
didn't realize until a recent talk by MarkM (I forget which, one of the
talking-to-Cosomos-folks ones) that the name had an illustrative
purpose: the coordination between its moving parts (a set of swinging
pieces). Mark mentioned that one advantage of the SwingSet was that it
allows determinism between a set of vats on a machine by having messages
pass through it. I didn't know about and hadn't considered this
previously. It seems like a powerful concept.

I think I could adopt a similar idea in Goblins.

Are there other interesting parts of the SwingSet architecture? Notes,
a presentation, or docs I should look at to see if there are other
applicable ideas? Especially if influential on CapTP interop between
Agoric <-> Goblins (though I suspect it wouldn't be... so okay maybe I'm
just interested in the new ideas).

(Sadly I know I missed a Friam presentation by Warner about
SwingSet... wish I hadn't. :( )

- Chris

Christopher Lemmer Webber

unread,
Dec 15, 2020, 11:49:27 AM12/15/20
to cap-...@googlegroups.com
Having asked this, I now see that the SwingSet package actually ships
with quite a few markdown files in the docs/ subdir that I should read
before asking for more information. ;)

Chris Hibbert

unread,
Dec 15, 2020, 11:11:05 PM12/15/20
to cap-...@googlegroups.com, Christopher Lemmer Webber
On 12/15/20 8:29 AM, Christopher Lemmer Webber wrote:
> Previously I assumed Agoric's "SwingSet" system was called that because
> I think the previous version was called "Playground"... I assumed then
> that this was an indicator of moving from alpha to beta status. I
> didn't realize until a recent talk by MarkM (I forget which, one of the
> talking-to-Cosomos-folks ones) that the name had an illustrative
> purpose: the coordination between its moving parts (a set of swinging
> pieces). Mark mentioned that one advantage of the SwingSet was that it
> allows determinism between a set of vats on a machine by having messages
> pass through it. I didn't know about and hadn't considered this
> previously. It seems like a powerful concept.

On the etymology question, we originally thought we were going to use a
"playground" theme for naming pieces of our system, so swingset, teeter
totter, slide, carousel. But SwingSet turned out to be the first and
last in that sequence (AFAICR). This is the first I've heard of the
connection to coordinating moving parts.

> Having asked this, I now see that the SwingSet package actually
> ships with quite a few markdown files in the docs/ subdir that I
> should read before asking for more information.
That's probably worthwhile, but if there's something you want to
understand better that's not written up there feel free to ask. You're
more likely to get a verbal or email response than a formal doc, but
we're generally happy to talk about the approach.

Chris
--
I think that, for babies, every day is first love in Paris. Every
wobbly step is skydiving, every game of hide and seek is Einstein
in 1905.--Alison Gopnik (http://edge.org/q2005/q05_9.html#gopnik)

Chris Hibbert
hib...@mydruthers.com
Blog: http://www.pancrit.org
Twitter: C_Hibbert_reads
http://mydruthers.com

Christopher Lemmer Webber

unread,
Dec 16, 2020, 3:30:04 PM12/16/20
to Chris Hibbert, cap-...@googlegroups.com
Chris Hibbert writes:

> On 12/15/20 8:29 AM, Christopher Lemmer Webber wrote:
>> Previously I assumed Agoric's "SwingSet" system was called that because
>> I think the previous version was called "Playground"... I assumed then
>> that this was an indicator of moving from alpha to beta status. I
>> didn't realize until a recent talk by MarkM (I forget which, one of the
>> talking-to-Cosomos-folks ones) that the name had an illustrative
>> purpose: the coordination between its moving parts (a set of swinging
>> pieces). Mark mentioned that one advantage of the SwingSet was that it
>> allows determinism between a set of vats on a machine by having messages
>> pass through it. I didn't know about and hadn't considered this
>> previously. It seems like a powerful concept.
>
> On the etymology question, we originally thought we were going to use
> a "playground" theme for naming pieces of our system, so swingset,
> teeter totter, slide, carousel. But SwingSet turned out to be the
> first and last in that sequence (AFAICR).

Ah :)

> This is the first I've heard of the connection to coordinating moving
> parts.

Interesting. I may have inferred based on the way MarkM described
something as "A SwingSet is a collection of..." and it got me to think,
okay, so that's the set, now how are they swinging?

Feel free to steal the moving parts connection as a back-explaination
even if unintended.

>> Having asked this, I now see that the SwingSet package actually
>> ships with quite a few markdown files in the docs/ subdir that I
>> should read before asking for more information.
>
> That's probably worthwhile, but if there's something you want to
> understand better that's not written up there feel free to ask. You're
> more likely to get a verbal or email response than a formal doc, but
> we're generally happy to talk about the approach.

Cool. Well, as it turns out, I did write down several questions:

1) How does SwingSet currently handle a failure of a vat turn?
Are state changes, such as the decrement of an integer value stored
under some variable, automatically reset?
I would think the right answer would be "yes", but I haven't been
able to imagine how it would be done with the present architecture I
understand Agoric to be using without either:

a) doing a full snapshot each turn, from the VM level

b) doing a full snapshot each turn, jhu-paper style

c) starting with initial conditions and replaying all messages to
return to current state, which would mean that failures would
result in quadratic replay behavior, so I cannot imagine this is
the case (or if it is, it must be a temporary thing)

2) Is there "hidden dynamism or global state" which allows a vat to work
right? There is in Spritely Goblins... a hidden "current-syscaller"
parameter/fluid, dynamically bound for a turn: to allow messages
which are not dispatched until end-of-turn to be queued when doing
<-, to build up the current transaction, etc etc. I hate this but as
far as I can tell, I would either have to do this or force a very
explicit monad everywhere. Thus I joke that Goblins can be thought
of as having an "implicit monad".

I saw in the docs that SwingSet has some notion of a "userspace";
I wondered what you were doing, and for that matter, how wavy-dot
was able to do a similar thing with queueing messages that are only
released after vat-turn.

3) For that matter, is there a description of what the swingset
nanokernel is, how it is booted up and what the available system
calls are? I'm curious to compare.

4) Is it true that "async/await" are currently discouraged but not yet
banned? I was surprised to see them in the ERTP implementation given
all the warnings MarkM has levied against re-entrancy attacks with
coroutines? (Okay I guess this is a more general Agoric question,
less SwingSet.)

That's mostly it. I keep wondering about these, especially 1 & 2.

- Chris

Chip Morningstar

unread,
Dec 17, 2020, 2:52:01 PM12/17/20
to cap-...@googlegroups.com
On Dec 16, 2020, at 12:30 PM, Christopher Lemmer Webber <cwe...@dustycloud.org> wrote:

1) How does SwingSet currently handle a failure of a vat turn?
  Are state changes, such as the decrement of an integer value stored
  under some variable, automatically reset?
  I would think the right answer would be "yes", but I haven't been
  able to imagine how it would be done with the present architecture I
  understand Agoric to be using without either:

  a) doing a full snapshot each turn, from the VM level

  b) doing a full snapshot each turn, jhu-paper style

  c) starting with initial conditions and replaying all messages to
     return to current state, which would mean that failures would
     result in quadratic replay behavior, so I cannot imagine this is
     the case (or if it is, it must be a temporary thing)

It’s a little unclear to me from the context what you mean by “failure of a vat turn”.

If you mean “what do you do if a vat fails?”, the answer is: we kill the vat.  Since vats are deterministic, if it fails once it will always fail and so there’s no point to trying to recover; the vat is considered to have broken the rules and is terminated as if the turn had never happened.

If you mean “what do you do if the hosting environment fails somehow?” (e.g., the process dies due to a hardware glitch, etc.), the answer is (c).  As you speculate, this is intended as a temporary thing, but perhaps not as temporary as you might hope.  The longer term plan is (a), but given that our principal target environment is blockchain, we need to have the snapshot be part of the consensus state, which requires much more careful specification and implementation than if all you needed to do was something more like a heap dump.  The good news is that to a very rough first approximation, such failures don’t happen, so the primary use case is spinning up a new validator, which typically can afford to spend some time.  Also, because of the way we log the kernel state and transaction history, vats can be replayed in parallel (though this optimization wrinkle is not currently implemented because we haven’t needed it yet).

Also, a bit of terminology that may be clarifying, especially if you are looking at the code: we use the word “turn” to refer to one pass through the underlying JavaScript engine’s event loop (what a lot of engine implementors call a “microtask”).  We use the word “crank” to refer to one pass through our higher level event loop, i.e., all of the activity that happens in a vat as a consequence of a message delivery into that vat, which typically encompasses multiple turns; basically, we let the vat run to quiescence, at which point it no longer has agency and control returns to the kernel.  We use the word “block” to refer to a series of cranks that are treated as a unit for purposes of resource management and consensus.

The kernel state is persistently committed at crank boundaries (though a later, more advanced implementation may choose to actually perform the commit at block boundaries to amortize the overhead, with replay used to fill in the gaps if there’s a failure, kind of the way sophisticated databases use a mixture of roll-forward and roll-back strategies) and contains everything needed to reconstruct the swingset state as of that moment in time; however, we can’t capture idiosyncratic memory state inside the vats (consider, for example, closed over variables) without the complicity of the underlying JavaScript engine, hence the persistence and failure story above.

2) Is there "hidden dynamism or global state" which allows a vat to work
  right?  There is in Spritely Goblins... a hidden "current-syscaller"
  parameter/fluid, dynamically bound for a turn: to allow messages
  which are not dispatched until end-of-turn to be queued when doing
  <-, to build up the current transaction, etc etc.  I hate this but as
  far as I can tell, I would either have to do this or force a very
  explicit monad everywhere.  Thus I joke that Goblins can be thought
  of as having an "implicit monad".

  I saw in the docs that SwingSet has some notion of a "userspace";
  I wondered what you were doing, and for that matter, how wavy-dot
  was able to do a similar thing with queueing messages that are only
  released after vat-turn.

I’m not sure I follow what you are asking about here.  Possibly the turn vs. crank distinction which I discussed above may capture part of the answer.  From the kernel’s perspective, a crank consists of one delivery into a vat plus all the syscalls the vat issues during the crank’s execution.  These syscalls are the means by which the vat sends messages and resolves promises, all of which result in events being placed into the kernel’s message queue.  The next crank is initiated by taking the next thing off the head of the kernel’s message queue and delivering it to whatever vat it’s addressed to.  Since the kernel state (including the state of the message queue) is checkpointed at the end of the crank, a failure that aborts the crank aborts the checkpoint, and it is as if any state changes made to the kernel by syscalls issued during the crank had never happened.  I don’t know if that answers any part of your question, but feel free to ask some more if not.

3) For that matter, is there a description of what the swingset
  nanokernel is, how it is booted up and what the available system
  calls are?  I'm curious to compare.

Obviously you’ve already found the `docs` directory.  The stuff in there describes a lot of this, though some of it may be slightly out of date (as documentation often is, alas) and of course it’s organized in a somewhat piecemeal fashion rather than as an overall Principles Of Operation style presentation (the latter is something we very much should do, but it’s just not at the front of our priority queue right now).  The other resource, of course, is the code itself.  With respect to system calls, a good place to start would be the file `kernelSyscall.js`

4) Is it true that "async/await" are currently discouraged but not yet
  banned?  I was surprised to see them in the ERTP implementation given
  all the warnings MarkM has levied against re-entrancy attacks with
  coroutines?  (Okay I guess this is a more general Agoric question,
  less SwingSet.)

These are discouraged, but as far as I know there are no plans to ban them.  It’s also the case that we regard them as more problematic in kernel and contract code and much less so in tooling (in particular, there are a lot of unit test things that become extremely challenging to implement without them, given the test framework we’re using (Ava)).  It’s more of a code hygiene thing rather than a question of fundamental semantics.  After much internal debate, we’ve decided that an `await` at the top level of a function (i.e., not inside an loop or conditional) is usually OK, and then we have lint rules that complain about any await that is not at the top level.

Hope this was helpful, but feel free to keep tossing questions at us in any case.

— Chip

Christopher Lemmer Webber

unread,
Dec 17, 2020, 4:59:24 PM12/17/20
to cap-...@googlegroups.com, Chip Morningstar
Chip Morningstar writes:

>> On Dec 16, 2020, at 12:30 PM, Christopher Lemmer Webber <cwe...@dustycloud.org> wrote:
>>
>> 1) How does SwingSet currently handle a failure of a vat turn?
>> Are state changes, such as the decrement of an integer value stored
>> under some variable, automatically reset?
>> I would think the right answer would be "yes", but I haven't been
>> able to imagine how it would be done with the present architecture I
>> understand Agoric to be using without either:
>>
>> a) doing a full snapshot each turn, from the VM level
>>
>> b) doing a full snapshot each turn, jhu-paper style
>>
>> c) starting with initial conditions and replaying all messages to
>> return to current state, which would mean that failures would
>> result in quadratic replay behavior, so I cannot imagine this is
>> the case (or if it is, it must be a temporary thing)
>
> It’s a little unclear to me from the context what you mean by “failure
> of a vat turn”.
>
> If you mean “what do you do if a vat fails?”, the answer is: we kill
> the vat. Since vats are deterministic, if it fails once it will
> always fail and so there’s no point to trying to recover; the vat is
> considered to have broken the rules and is terminated as if the turn
> had never happened.

Ah... well, let me be clearer. What I mean is: what happens if an
exception is thrown by code running within the turn?

My impression is that in E, an uncaught exception would not result in
termination of the vat, but all promises which would be resolved with
the return-value of said turn would be broken. Perhaps I am wrong, but
I interpreted a contract/guard/dynamic-type-check error in the makeMint
example in the ode to be more or less throwing an uncaught exception,
and any relevant promises would be broken but the vat would march
onward.

http://erights.org/elib/capability/ode/ode-capabilities.html

It seems that aside from the type annotations, a failure at the
unsealer.unseal(...) part of the code would mean that the turn would be
terminated, but the vat would not be terminated in such a way as to mean
"would not receive future messages". (Otherwise, this code appears to
make the mint very fragile!)

I assumed that the same approach was being taken with javascript here.
Is that wrong?

> If you mean “what do you do if the hosting environment fails somehow?”
> (e.g., the process dies due to a hardware glitch, etc.), the answer is
> (c). As you speculate, this is intended as a temporary thing, but
> perhaps not as temporary as you might hope. The longer term plan is
> (a), but given that our principal target environment is blockchain, we
> need to have the snapshot be part of the consensus state, which
> requires much more careful specification and implementation than if
> all you needed to do was something more like a heap dump. The good
> news is that to a very rough first approximation, such failures don’t
> happen, so the primary use case is spinning up a new validator, which
> typically can afford to spend some time. Also, because of the way we
> log the kernel state and transaction history, vats can be replayed in
> parallel (though this optimization wrinkle is not currently
> implemented because we haven’t needed it yet).

Not what I was asking, but interesting to know.

> Also, a bit of terminology that may be clarifying, especially if you
> are looking at the code: we use the word “turn” to refer to one pass
> through the underlying JavaScript engine’s event loop (what a lot of
> engine implementors call a “microtask”). We use the word “crank” to
> refer to one pass through our higher level event loop, i.e., all of
> the activity that happens in a vat as a consequence of a message
> delivery into that vat, which typically encompasses multiple turns;
> basically, we let the vat run to quiescence, at which point it no
> longer has agency and control returns to the kernel. We use the word
> “block” to refer to a series of cranks that are treated as a unit for
> purposes of resource management and consensus.

That's interesting. I'm not sure I understand what multiple turns in a
crank would look like in practice. Do you have an example?

> The kernel state is persistently committed at crank boundaries (though
> a later, more advanced implementation may choose to actually perform
> the commit at block boundaries to amortize the overhead, with replay
> used to fill in the gaps if there’s a failure, kind of the way
> sophisticated databases use a mixture of roll-forward and roll-back
> strategies) and contains everything needed to reconstruct the swingset
> state as of that moment in time; however, we can’t capture
> idiosyncratic memory state inside the vats (consider, for example,
> closed over variables) without the complicity of the underlying
> JavaScript engine, hence the persistence and failure story above.

I see. Yes it was the closed over variables bit that I was especially
confused about.
Does the kernel run several vat cranks simultaneously, or is it
one-vat-crank-at-a-time?

Let's say I do a wavy-dot-send or something... that ends up in the
queue. Ok. But if multiple vats are running simultaneously, somewhere
there must be something that binds multiple messages to be queued up for
this particular turn/crank, if they are indeed released all at once at
the end. I don't understand how that's being done without some
context-sensitive information if it's multiple-cranks-at-a-time... you
would at least need to make that somehow some "thread exlusive data",
I'd think. But I might just be thinking wrong.

But even the fact that wavy-dot stuff ends up in the queue somewhere
means that somewhere in the system, there is something that is routing
that wavy-dot-sending to a globalish queue I'd think...

>> 3) For that matter, is there a description of what the swingset
>> nanokernel is, how it is booted up and what the available system
>> calls are? I'm curious to compare.
>
> Obviously you’ve already found the `docs` directory. The stuff in
> there describes a lot of this, though some of it may be slightly out
> of date (as documentation often is, alas) and of course it’s organized
> in a somewhat piecemeal fashion rather than as an overall Principles
> Of Operation style presentation (the latter is something we very much
> should do, but it’s just not at the front of our priority queue right
> now). The other resource, of course, is the code itself. With
> respect to system calls, a good place to start would be the file
> `kernelSyscall.js`

Thanks, will take a look.

>> 4) Is it true that "async/await" are currently discouraged but not yet
>> banned? I was surprised to see them in the ERTP implementation given
>> all the warnings MarkM has levied against re-entrancy attacks with
>> coroutines? (Okay I guess this is a more general Agoric question,
>> less SwingSet.)
>
> These are discouraged, but as far as I know there are no plans to ban
> them. It’s also the case that we regard them as more problematic in
> kernel and contract code and much less so in tooling (in particular,
> there are a lot of unit test things that become extremely challenging
> to implement without them, given the test framework we’re using
> (Ava)). It’s more of a code hygiene thing rather than a question of
> fundamental semantics. After much internal debate, we’ve decided that
> an `await` at the top level of a function (i.e., not inside an loop or
> conditional) is usually OK, and then we have lint rules that complain
> about any await that is not at the top level.

Got it.

> Hope this was helpful, but feel free to keep tossing questions at us
> in any case.
>
> — Chip

Very helpful, thank you for taking the time... I know you all are very
busy.

- Chris

Mark S. Miller

unread,
Dec 17, 2020, 9:49:21 PM12/17/20
to cap-...@googlegroups.com, Chip Morningstar
For normal thrown errors, for both E and the Agoric platform, you are correct. A thrown error terminates up the call stack until it gets to a try/catch/finally or the top of the call stack. If it gets to the top of the call stack, then it turns into a rejection of the promise for the result of the turn. The promise for the result of the turn becomes rejected with the thrown error as the `reason` for rejection. That vat then goes on to service the next request in the event loop.

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cap-talk/87im90rt51.fsf%40dustycloud.org.

Christopher Lemmer Webber

unread,
Dec 18, 2020, 5:04:19 PM12/18/20
to cap-...@googlegroups.com, Chip Morningstar, Mark S. Miller
Great, thanks for confirming.

Is it correct that what the swingset nanokernel does with the js vats in
general is to restore the vat from the previous checkpoint (to avoid any
closure-foo type issues)? (Or, is it restored every turn? That would
be kind of wild.)

Chip Morningstar

unread,
Dec 31, 2020, 4:30:56 PM12/31/20
to cap-...@googlegroups.com
Finally circling back to answering more of Chris’ questions...
The answer is mostly the same in JavaScript, except that what happens in the case of an unhandled promise is a little slipperier in the JavaScript world because JavaScript promise semantics are a little slipperier than E’s.  Without getting into the particulars of the mess that various JS platforms have made of unhandled promise handling, the short answer is: mostly just go with your intuitions from E. 
Sure, it’s pretty trivial.  Consider a method `bar` on an object `foo`:

```
  function bar() {
    console.log('before');
    const p = Promise.resolve(47);
    p.then(v => console.log(`it resolved to ${v}`));
    console.log('after');
  }
```

(`Promise.resolve(value)` produces a promise that is born resolved to the value you provide.)

If you (from some other vat) do `foo~.bar()`, the delivery of the `bar` message will initiate a crank that will consist of the execution of the `bar` method in `foo`’s vat, yielding the console output:

```
  before
  after
  it resolved to 47
```

Here we have a single crank consisting of two turns.  In the first turn, the `bar` method runs to completion, producing the first two lines of output.  In the second turn, the resolve handler passed to `then` is invoked, producing the third and final line of output.  After the second turn, there is nothing else to do and the crank ends.

Resolution (or rejection) of a promise is always handled in its own, separate turn in JavaScript.
Right now the kernel only runs one vat at a time.  This is how it must appear to work, though there is nothing preventing it from actually executing vats concurrently as long as the illusion of sequentiality is maintained.  Doing that would require some extra bookkeeping in how we manage the queues, so for now the easiest way to achieve the appearance of sequentiality is to actually execute them sequentially.  Eventually we’d like to let them execute concurrently, since that would allow us to exploit multiple processor cores, but that’s work for the future.

Let's say I do a wavy-dot-send or something... that ends up in the
queue.  Ok.  But if multiple vats are running simultaneously, somewhere
there must be something that binds multiple messages to be queued up for
this particular turn/crank, if they are indeed released all at once at
the end.  I don't understand how that's being done without some
context-sensitive information if it's multiple-cranks-at-a-time... you
would at least need to make that somehow some "thread exlusive data",
I'd think.  But I might just be thinking wrong.

Yes, that’s basically the extra bookkeeping that I was referring to.

But even the fact that wavy-dot stuff ends up in the queue somewhere
means that somewhere in the system, there is something that is routing
that wavy-dot-sending to a globalish queue I'd think...

The current implementation has a single global message queue, and message send syscalls append to it directly.  An implementation supporting concurrently executing vats would collect each vat’s outgoing message sends separately and then stick them on the queue as each vat finishes its crank.  A even more sophisticated implementation might be able to look at the emerging causaility graph and do various optimizations by combining and rearranging subgraphs.  I don’t think there’s any real limit to how fancy you could get to gain additional levels of concurrency, but none of that is anything we need to worry about right now.

— Chip


Reply all
Reply to author
Forward
0 new messages