Supervision and Fault tolerance

Tim Carey-Smith

unread,

Apr 13, 2013, 1:01:40 AM4/13/13

to cellulo...@googlegroups.com

Hi all,

Having fielded many questions about actor supervision, linking and registries, I thought it would be good to unify the discussion of a solution here, rather than many issues on the repo.

My plan is to embrace some of the ideas in Akka to make Celluloid more intuitive with all of these subjects.

With regard to supervision and linking, I would propose that we make each actor a supervisor of any actors started within itself.
This would also include adding the ability for supervisor strategies. [one-for-one/all-for-one]
See [3] for the differences between erlang and Akka with regard to default strategies.

There would be a root supervisor for managing core service actors (see below) and any user-space actors.
This can be extended to support a filesystem-like registry for actors. [2]

The primary reason for going this direction is to avoid a global namespace for actors.
And also to make the supervision hierarchy more natural (IMHO) by having actor creation determine supervision.
Instead of using supervisors, you just start actors.

These issues are attempting to tackle the problem from two angles, but I believe a deeper change is in order to resolve some of these dependent issues.

https://github.com/celluloid/celluloid/pull/145 - linking by default
https://github.com/celluloid/celluloid/issues/148 - nested supervisors

Here is a idea of how it might appear to an end-user, https://gist.github.com/halorgium/5371ea0a410821810d73

This does raise the question of default Celluloid actors.
I believe that we need to make it obvious that Celluloid has a need to run actors to serve as a foundation for normal behaviors like logging and root-supervision.

https://github.com/celluloid/celluloid/issues/158 - Supervisor.root missing feature
https://github.com/celluloid/celluloid/issues/200 - the removal of autostart creates non-determinism in shutdown
https://github.com/celluloid/celluloid/issues/203 - confusion surrounding Threads

I would love to have some feedback about this, I think it is important to make sure that Celluloid promotes well-structured and stable development.

Further reading at:
[1] http://doc.akka.io/docs/akka/2.1.2/general/actor-systems.html
[2] http://doc.akka.io/docs/akka/2.1.2/general/addressing.html
[3] http://doc.akka.io/docs/akka/2.1.2/scala/fault-tolerance.html

Thanks,
Tim

Ben Langfeld

unread,

Apr 13, 2013, 11:12:38 AM4/13/13

to cellulo...@googlegroups.com

As we discussed on IRC, I like all of these ideas.

Personally, my only beef with system actors is that it's difficult to simply terminate all user-space actors (for example between tests), but with registry namespacing that would be easily solved.

As for addressing, I like Akka's approach. I also like the idea of enforcing that every actor has to have a registry entry which is either given a nice name or has a UUID generated.

Where do we start?

Regards,
Ben Langfeld

--
You received this message because you are subscribed to the Google Groups "Celluloid" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celluloid-rub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tim Carey-Smith

unread,

Apr 13, 2013, 8:49:59 PM4/13/13

to cellulo...@googlegroups.com

Well, I mostly want to see if Tony was onboard.
This is for buy-in for pretty large scale changes.

After that, I would hope to get a prototype branch up and running soon.
I would most likely start with the supervision registry graph and then auto-linking.

Regarding addressing, there is a minor catch-22 issue with startup I found when getting DCell working with blocks again.
https://github.com/celluloid/dcell/pull/40#issuecomment-16293645

Anyway, I'd love to hear from some more people and also get a few more of the APIs fleshed out (a little).

Ciao,
Tim

Tony Arcieri

unread,

Apr 14, 2013, 4:07:38 PM4/14/13

to cellulo...@googlegroups.com

If you want to rewrite the supervisor system Tim, go for it. The existing code is fucked and needs massive improvements.

Tony Arcieri

Tim Carey-Smith

unread,

Apr 15, 2013, 5:42:15 AM4/15/13

to cellulo...@googlegroups.com

Great!

I had a good talk with a CTO of a company here in New Zealand who is using Scala/Akka full-time.

I was able to clarify quite a few concerns and questions I had.
These were relating to routers, restarting, supervisors, actor references and just about everything TBH.

There was a good solution to the worker problem.
And it became obvious that a Router in Akka parlance is simply a load-balancer and so does not do any management of work distribution.
It being a way to hide many actors was a simple enough concept to have large benefits.

With regard to restarting, an ActorRef (what we call a proxy right now) does not become terminated/dead upon restart.
But if an Actor is stopped and then restarted at a later point (even with the same name/path), the original ActorRef is terminated/dead.
I agree with this and no doubt, Tony, you will too.

As to the default strategy of "restart", I am not sure about this.

Akka has several concepts which we do not have.
One of which is ActorSystem. This is a full collection of system and user actors which is independent of another ActorSystem.
I believe it would be most useful for testing. As we could easily shutdown the previous ActorSystem and start a new one with entirely new state (including the EventSystem/logging).

I have a branch which splits out the core actor functionality (receiver/timers/tasks) from the Object-specific code like SyncCall and friends.
I do think that we need to introduce a new concept ActorContext to encapsulate the notion of a mailbox wrapped up separate to the underlying state introduced by starting an Actor and it running for a moment or 2.

All in all, I think there is a reasonably solid idea of where to go, at least in my head.

Catch you,
Tim

Ben Langfeld

unread,

Apr 15, 2013, 10:17:26 AM4/15/13

to cellulo...@googlegroups.com

Tim,

Would be awesome if you could get a bunch of details about that direction down in writing so we can divide up work :) I can probably only tackle small things, but it's something at least.

Regards,
Ben Langfeld

Jonathan Rochkind

unread,

Apr 15, 2013, 11:09:04 AM4/15/13

to cellulo...@googlegroups.com, Ben Langfeld

On 4/13/2013 11:12 AM, Ben Langfeld wrote:
> As we discussed on IRC, I like all of these ideas.
>
> Personally, my only beef with system actors is that it's difficult to
> simply terminate all user-space actors (for example between tests), but
> with registry namespacing that would be easily solved.

This kind of registry namespacing might also make it easier to
distinguish between system-created actors and user-space actors in the
"Terminating X actors" message on program exit -- or even (optionally,
in some contexts maybe) suppress the messages for system-created actors
and only show it for user-space actors.

I think that would get rid of a lot of the issues people have with
Celluloid's previous behavior of always outputting that message in
anything that has "require 'celluloid'" in it. Which led Tony to move
to _not_ starting the system actors automatically, unless they are
specifically asked for -- but, while I understand what prompted it (I
was one of the people complaining about the 'Terminating X actors'
message), not starting system actors which are neccesary for some
functionality to work, seems like a fragile solution.

Tim Carey-Smith

unread,

Apr 15, 2013, 6:25:16 PM4/15/13

to cellulo...@googlegroups.com

Yup, this is exactly right.

The primary reason is to have a DAG to make the shutdown semantics easier to reason able.
If you read the restarting explanation in Akka[1], you'll see it is simply a recursive descent.

The side-benefit is definitely that you have a well-structured and composed app.
It is also possible to duplicate sub-graphs of the actor system.

This reminds me that the naming/pathing registry can mostly be used with relative lookups once you have a sub-graph of actors.
It is easy enough to try and find "../worker-router", thus removing the notion of global lookups. [2]

In Akka, the ActorRef has an ActorContext which in turn has an ActorRefProvider.
There are 3 implementations of ActorRefProvider: Local, Remote and Cluster.

This ties nicely into how we could more simply make DCell behave.
Perhaps an ActorSystem which was DCell enabled, if that makes sense.

Ciao,
Tim

[1] http://doc.akka.io/docs/akka/2.2-M2/general/supervision.html#What_Restarting_Means
[2] http://doc.akka.io/docs/akka/2.2-M2/general/addressing.html#Absolute_vs__Relative_Paths

Tim Carey-Smith

unread,

Apr 15, 2013, 6:25:29 PM4/15/13

to cellulo...@googlegroups.com

Yup, I agree.
I was waiting for a few things to settle in my mind and also to talk with this Akka person.

I am heading up to go camping by the beach today.
I will sketch out a good plan of action.

I think the most critical thing to achieve first is the unification of Actor and Supervisor.
After that, automatically linking actors together shouldn't be too hard.
And then adding strategies for supervision.
Then perhaps, improve the concept of actor pathing.

I will be back in the city Tuesday next week (NZ time), maybe it would be possible to have a Google Hangout?

Ciao,
Tim

AJ Christensen

unread,

Apr 15, 2013, 6:26:13 PM4/15/13

to cellulo...@googlegroups.com

ah snap the ../worker path idea (instead of using a Global) is super dope

--AJ

Ben Langfeld

unread,

Apr 15, 2013, 8:54:57 PM4/15/13

to cellulo...@googlegroups.com

A hangout would be cool, though picking a timezone will be fun! How does Wed 24th, 1500 PST / 1000 NZST / 1900 BRT suit you guys?

Regards,
Ben Langfeld

Tony Arcieri

unread,

Apr 15, 2013, 9:14:11 PM4/15/13

to cellulo...@googlegroups.com

25th would work a lot better for me

--

Tony Arcieri

Tim Carey-Smith

unread,

Apr 16, 2013, 2:00:09 AM4/16/13

to cellulo...@googlegroups.com

That should work for me.
I am planning on heading to more beaches in the afternoon, I think.

http://www.timeanddate.com/worldclock/meetingdetails.html?year=2013&month=4&day=25&hour=22&min=0&sec=0&p1=224&p2=22&p3=213

Does anyone else want to join in?
I think the only TZs which are annoying are in Asia. So apologies!

Will we announce a URL in channel or something?

From a tent somewhere in New Zealand,
Tim

Ben Lovell

unread,

Apr 16, 2013, 3:15:35 AM4/16/13

to cellulo...@googlegroups.com

At this stage it would be easy for me to just link the akka docs and say "all of this, please" ;)

However, the things that I'm missing in no necessary order:

- Routers being a Thing. It would be so neat to have our equivalent of akka routing.

- Fault tolerance strategies. "Let it crash" is mostly working for me but in some instances I'd like to specify max number of retries and these kind of things.

- Leading on from the previous point, namespacing for actors rather than the current global namespace.

- A PoisonPill equivalent or just better (graceful) shutdown semantics. I'm having to cleanly shutdown pools of workers and have some shamefully hokey method of polling the pool (and in turn its workers) until they're all idle so I can allow them to work off their mailboxes. I see via Tim, that akka handles this scenario by re-routing an actor's mailbox contents to a system-wide dead letter mailbox. I'm not sure this is so helpful in my scenario but maybe I'm missing something obvious here?

I'll weigh in where I can.

Cheers,

Ben

Tim Carey-Smith

unread,

Apr 16, 2013, 4:51:39 AM4/16/13

to cellulo...@googlegroups.com

I think your misconception that we don't have a graceful shutdown process is a signal of where the ATOM model is confusing and harder to reason about.

A shutdown in Celluloid is delayed until the next message is handled, so it does not interrupt the current piece of work.
Using a pattern which handles the work running on actors, it is theoretically possible to do nice shutdowns.
As we have seen though, when a method call suspends and is another message is allowed to be handled, this can cause the shutdown of an actor to occur while there are still outstanding tasks in progress.

The notion of a PoisonPill is what we have right now, the trouble is that we need to continue to handle messages even once the shutdown has commenced.
This is related to the issue with finalize suspension. https://github.com/celluloid/celluloid/issues/212

Basically, we don't have a good way, without waiting for all tasks.
This was thought of in https://github.com/celluloid/celluloid/issues/197

With regard to dead-letters, my understanding of the Akka routers was flawed.
They do not handle direct allocation of work to actors. Instead they do allow the mailboxes to contain many messages and attempt to handle back-pressure in a small way.

I talked with someone who has implemented a solution for this problem in Akka.
My example for this is at https://github.com/halorgium/celluloid-coordinator/blob/master/demo.rb

Hope this makes sense,
Tim

Tim Carey-Smith

unread,

Apr 22, 2013, 6:33:45 AM4/22/13

to cellulo...@googlegroups.com

Hey all,

I have documented some of my thoughts on a wiki page.
https://github.com/celluloid/celluloid/wiki/Overhaul-Plan

They are in a somewhat chronological order so as to make the changes as easy as possible.

I thought I would also confirm that I should be in internet range for our hangout this week!

Looking forward to getting started on this,
Tim

Tony Arcieri

unread,

Apr 22, 2013, 5:14:06 PM4/22/13

to cellulo...@googlegroups.com

+1 on all of that

Ben Langfeld

unread,

Apr 25, 2013, 7:53:48 PM4/25/13

to cellulo...@googlegroups.com

We had a hangout session about this stuff and other things, and there's a video available at https://www.youtube.com/watch?v=-n-67dXegNI for anyone who is interested.

Regards,
Ben Langfeld

Reply all

Reply to author

Forward