Working on thread pooling for Actors

68 views
Skip to first unread message

li...@chuckremes.com

unread,
Apr 16, 2017, 1:23:13 PM4/16/17
to cellulo...@googlegroups.com
Sorry that this post is long. :)

So I’ve started poking at the thread pooling code for Actors again. It’s long been clear that celluloid spins up too many threads so context switching overhead starts to dominate when there are thousands of actors. This has been a problem for me on several projects.

Anyway, while investigating I of course started by looking at Celluloid::Group::Pool. The bit rot is deep there. Comparing it to Group::Spawner it’s clear that the dependent code is expecting a thread to be returned. However, Group::Pool pushes a block onto a queue and returns that thread. Nowhere do I see that queue ever getting processed. The work is still done (via yield) but we keep a copy of every block in that unprocessed queue. The warnings about not using this code are correct!

But it turns out that isn’t the right place to start working on this problem anyway. Most of the threads created during runtime seem to come from Celluloid::Actor#start. This method gets a new thread and then runs a loop to read from the Actor’s mailbox. This thread NEVER ENDS unless Actor#terminate is called, an exception is raised, or a TerminationRequest message is processed. So if you have 10k actors, there will be 10k threads just to read the mailboxes. This doesn’t account for threads needed to actually run your business logic.

So I’m here to ask a question about next steps. In my opinion it makes sense to correct this problem by modifying Celluloid::Mailbox. (There is an alternate evented mailbox in the codebase but it’s apparently for use with Celluloid::IO & Celluloid::ZMQ.)

Right now this class wraps an array in a mutex and uses a condition variable to signal the thread (created by the Actor) to read the next message off the mailbox queue. I have a few ideas that might improve this.

1. Why not use a Fiber for the Actor’s mailbox loop that will yield when the queue is empty and resume when a message arrives? Is it because JRuby backs each Fiber with a dedicated Thread anyway so switching to Fibers doesn’t reduce resource usage? Maybe this solution is only reasonable on MRI which still has real green threads.

2. Create a few dedicated threads that process all Actor mailboxes. When events are found, push those tasks off to another thread pool for execution (which is what Task is for I believe). The mailbox thread(s) would sleep if all mailboxes are empty. Adding events to any mailbox or expiring timers would wake the thread(s).

Have these ideas been tried and had a fatal flaw that couldn’t be solved?

Tony Arcieri

unread,
Apr 16, 2017, 1:32:40 PM4/16/17
to cellulo...@googlegroups.com
On Sun, Apr 16, 2017 at 10:23 AM, <li...@chuckremes.com> wrote:
1. Why not use a Fiber for the Actor’s mailbox loop that will yield when the queue is empty and resume when a message arrives? Is it because JRuby backs each Fiber with a dedicated Thread anyway so switching to Fibers doesn’t reduce resource usage? Maybe this solution is only reasonable on MRI which still has real green threads.

Ideally task fibers could be scheduled across a thread pool, providing a sort of M:N userspace threading model. Unfortunately MRI does not support resuming a fiber in a different thread than the one it was created in.

Several years ago I talked to ko1 and several others about supporting this. It would require some guarantees MRI does not provide now which would be beneficial, IMO, e.g. you're not allowed to suspend a fiber while holding a mutex.

It's unlikely to happen though.

--
Tony Arcieri

li...@chuckremes.com

unread,
Apr 17, 2017, 11:56:24 AM4/17/17
to cellulo...@googlegroups.com
That’s an unfortunate limitation. 

I’m going to try and spike out my idea #2 where a few dedicated threads handle mailbox processing for all actors. I’ll let the list know how it goes.


li...@chuckremes.com

unread,
Apr 21, 2017, 1:59:03 PM4/21/17
to cellulo...@googlegroups.com
I’ve spent some time this week spiking out an idea to have all mailboxes processed by a pool.

First order of business was getting Celluloid::Group::Pool working correctly. Took about 3 hours but got that done and working. Some new specs too. See those changes on my fork on the pool-mailbox branch here:
https://github.com/chuckremes/celluloid/blob/pool-mailbox/lib/celluloid/group/pool.rb

Next I modified (really just subclassed) Celluloid::Actor::System, Celluloid::Actor, and Celluloid::Mailbox. I made “Pooled” variants of these classes and overrode a few methods in the subclasses to have different behavior. (None of this code has been pushed to GitHub).

The main change was replacing Actor#start which begins a near-infinite loop to process all incoming mailbox requests. The subclass has a new #start method which sets things up and a #run_once method. When the actor is created, it registers itself with my System subclass. It registers the mailbox and the actor. I map the mailbox.address as a key and store the actor reference.

Now when some code adds to the mailbox (e.g. some_actor.mailbox << message), the #<< method passes the mailbox address to System and asks it to schedule a block to run from the pool.

e.g.

def schedule_mailbox(mailbox)
address = nil
actor = nil
@mutex.synchronize do
address = mailbox.address
actor = @map[address]
end

get_thread do
actor.run_once
end
end

This lets the actor process its mailbox and handle any waiting messages. The default/original behavior is that #<< uses a condition variable to signal the actor’s runloop to wake up and process the mailbox. So, pretty simple.

But I have now run into an architectural issue that I do not understand how to solve. So I need help from the community. Hopefully someone here can help me get over the hump. Here’s the problem. The Celluloid module has a method called #mailbox. See here:

https://github.com/celluloid/celluloid/blob/master/lib/celluloid.rb#L99

This method lazily creates a Mailbox for whatever thread is currently running that doesn’t have one. It turns out that sometimes there are running threads that are NOT actors but the thread still has a mailbox. I do not understand why!

This method is actually called from Celluloid::Call:Sync#method_missing and a few other similar *critical* locations. When I replace Mailbox with my Mailbox::Pooled subclass here, there is not actor set for the thread via Thread.current[:celluloid_actor]. Therefore, when I try to map mailbox address to an actor and schedule that actor to run, it’s Nil. If I skip it when nil, the mailbox from Celluloid.mailbox is never processed.

If I leave that Mailbox in its original form (condvars for signaling) then the whole thing blocks.

What is Celluloid.mailbox and why does it exist? Why are mailboxes outside of Actors allowed?

Tony Arcieri

unread,
Apr 21, 2017, 2:07:05 PM4/21/17
to cellulo...@googlegroups.com
On Fri, Apr 21, 2017 at 10:59 AM, <li...@chuckremes.com> wrote:
This method lazily creates a Mailbox for whatever thread is currently running that doesn’t have one. It turns out that sometimes there are running threads that are NOT actors but the thread still has a mailbox. I do not understand why!

Because if it didn't, it wouldn't be possible for non-actor code to call into actors 

li...@chuckremes.com

unread,
Apr 21, 2017, 2:36:52 PM4/21/17
to cellulo...@googlegroups.com
Ah, crap… yeah, that makes sense.

li...@chuckremes.com

unread,
Apr 21, 2017, 5:17:31 PM4/21/17
to cellulo...@googlegroups.com
So I don’t think this approach will work.

The problem as I see it is that while celluloid let’s us build asynchronous communications, some of its internals are most definitely synchronous. For example:

https://github.com/celluloid/celluloid/blob/master/lib/celluloid/call/sync.rb#L51

This will *never* exit while it waits for someone to push a message onto the mailbox that responds to :call.

Normally that wouldn’t be a problem but Actors talk to themselves through the mailbox too. So you end up with a situation where a suspended task gets resumed multiple times from multiple places. Crash.

I’m not smart enough to figure this out today. If anyone has any bright ideas on an approach that might work, I’d love to hear them.

li...@chuckremes.com

unread,
Apr 25, 2017, 4:18:37 PM4/25/17
to cellulo...@googlegroups.com
I’ve continued to work on this. So far I have found multiple ways that do not work. I haven’t found one that works yet. :)

So, a few lessons learned… I rarely use Fibers but it became clear that using Fiber-backed tasks with a thread pool was unworkable. Ruby does not allow fibers to migrate across threads. So now there would need to be a mechanism to make sure that, when using Task::Fibered, any actors assigned a thread would always use the same thread. This leads to deadlocks because we can’t *move* fibers from one thread to another; they all need to stay on the thread that spawned them. So, Task::Threaded is a requirement.

Second, Mailbox#receive blocks indefinitely (no timeout) for many operations. This too can lead to deadlock where another part of the system has pushed work onto that thread’s queue that needs to be completed before the #receive will return. I added a special call to Mailbox#receive that would migrate all queued work from that thread to a new thread and removed the active thread from the pool so no new work would get added to it. That would allow the thread to block indefinitely and also allow other work to continue and advance the state of the system.

Unfortunately, Task::Threaded complicates this. Tasks can also block indefinitely. So I tried the same trick where, before blocking on a condition variable, the thread moves enqueued work elsewhere and then blocks. This too leads to a deadlock situation. I can’t quite figure out *why* since in the degenerate case every work unit has its own thread (like using Group::Spawner). There must be some edge case I am missing.

I did try to separate out the mailbox threads from the task threads to their own pools. No joy.

At the moment I am out of ideas on how to proceed. I’ll rubber duck this some more and come up with some new approaches.

Tony Arcieri

unread,
Apr 25, 2017, 4:45:45 PM4/25/17
to cellulo...@googlegroups.com
Sounds like you're at least bumping into the constraints Celluloid was designed under and hopefully now understand why it's the way it is.

I have defibitely wanted a better scheduling model for Celluloid, and repeatedly talked to MRI core devs about Ruby features which would imprpve the status quo. Unfortunately, I don't think any of the changes that would enable a better scheduling model are forthcoming, and that the current design, however far from ideal, is something of a local maximum.

--
You received this message because you are subscribed to the Google Groups "Celluloid" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celluloid-rub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Tony Arcieri

li...@chuckremes.com

unread,
Apr 25, 2017, 6:30:26 PM4/25/17
to cellulo...@googlegroups.com

> On Apr 25, 2017, at 3:45 PM, Tony Arcieri <bas...@gmail.com> wrote:
>
> Sounds like you're at least bumping into the constraints Celluloid was designed under and hopefully now understand why it's the way it is.
>
> I have defibitely wanted a better scheduling model for Celluloid, and repeatedly talked to MRI core devs about Ruby features which would imprpve the status quo. Unfortunately, I don't think any of the changes that would enable a better scheduling model are forthcoming, and that the current design, however far from ideal, is something of a local maximum.

Yes, I think so.

I may be wrong, but I believe the design is limited by two choices.

1. Pipelining

2. Interaction with non-actors

ATOM mode is the default. For those following along, read this page:

https://github.com/celluloid/celluloid/wiki/Pipelining-and-execution-modes

An actor creates a task for every incoming message. When the message leads to some blocking IO behavior (or a synchronous call to another actor) the task is suspended and another task is allowed to run. Currently this defaults to using fibers for the task suspension/resumption. Note my complaint earlier in this thread about fibers not being able to execute on any thread other than its creator. Pipelining and lack of migratory fibers makes thread pooling pretty much impossible (well, possible but infeasible).

The nice part of this is that using fibers means we do not need to use mutexes. Fibers / cooperative multitasking makes it easier to design the program flow so that we can avoid lots of locking. Note that when replacing Task::Fibered with Task::Threaded the implementation requires several mutexes, protected queues, and condition variables. This stuff ain’t easy to do correctly. :)

Next, letting non-actors interact with actors in a synchronous way necessitates the use of a global mailbox (per thread) for communicating with the actor. This requires locks and condition variables for signaling.

As soon as well allow work units to migrate between threads, the rate of deadlock explodes. I have not figured out how to resolve this. Not sure I can.

If all actors worked in exclusive mode by default, an event-driven mailbox would be more feasible. But, we would lose some of the nice things that celluloid gives us. I really like the (default) synchronous-looking method calls between actors. It makes the code flow easier to understand.

Sigh. Trade offs.

Tony Arcieri

unread,
Apr 25, 2017, 6:35:13 PM4/25/17
to cellulo...@googlegroups.com
On Tue, Apr 25, 2017 at 3:30 PM, <li...@chuckremes.com> wrote:
If all actors worked in exclusive mode by default, an event-driven mailbox would be more feasible. But, we would lose some of the nice things that celluloid gives us.

You are pretty much describing Celluloid 0.1 here.

--
Tony Arcieri

li...@chuckremes.com

unread,
Apr 25, 2017, 8:22:09 PM4/25/17
to cellulo...@googlegroups.com
I was a cool.io and revactor user at some point in the distant past. Each of your attempts at async I/O got better. I like celluloid the best.

So, why did you move away from the event-driven mailbox design in 0.1 to what we have today? If I understand the tradeoffs maybe we can find a happy compromise that is less resource intensive (fewer threads & locks) while keeping programmer happiness (every/after blocks, synch calls, etc). What must-have feature(s) got us to where we are today?

Tony Arcieri

unread,
Apr 26, 2017, 1:37:31 PM4/26/17
to cellulo...@googlegroups.com
On Tue, Apr 25, 2017 at 5:22 PM, <li...@chuckremes.com> wrote:
So, why did you move away from the event-driven mailbox design in 0.1 to what we have today? If I understand the tradeoffs maybe we can find a happy compromise that is less resource intensive (fewer threads & locks) while keeping programmer happiness (every/after blocks, synch calls, etc). What must-have feature(s) got us to where we are today?

Because it was exceedingly easy for actor "RPCs" to reenter the object, so much so Celluloid felt useless. ATOM mode originally felt like a beautiful solution to this problem, and felt like "Revactor enabling" an actor. 

--
Tony Arcieri

li...@chuckremes.com

unread,
Apr 26, 2017, 4:51:14 PM4/26/17
to cellulo...@googlegroups.com
Thanks for sharing some of the thinking behind this. The rest of this message is In My Humble Opinion, so I hope no one reading gets pissed by it.

Your earlier response coupled with this one (“ATOM mode originally felt like a beautiful solution”) worried me. It worried me that maybe I didn’t understand the Actor model properly. So, I went internet spelunking and found a great video of Carl Hewitt explaining Actors. See it here:


I took some notes from watching this ~40m video. I reproduce them below (some with time stamps). But I want to draw attention to a few things here. These are all consistent with what I learned about actors Long Ago.

* ATOM / pipeline mode only makes sense if the Actor is purely functional (referential transparency).
* There is NO REQUIREMENT that Actors pipeline messages. They may act on them one at a time without violating the axiom.
* Actor model does not care about message ordering. Great example in video about the indeterminacy of a bank balance and how it changes depending on the order of messages it receives. For message ordering, implement an actor to enforce FIFO/sequencing.

What I get from the above data points is that celluloid made an incorrect choice in preferring ATOM mode by default. IMHO, most programmers are not creating purely functional actors. Instead, they are using actors to enforce serialized access to a mutable property. Therefore, pipelining actually works against their goal. Secondly, most programmers will want to enforce ordering on those messages so the Actor mailbox should act FIFO. (Celluloid mostly allows this except for high-priority SystemEvent messages.)

Anyway, pipelining is an implementation detail that I think works against the goals of a lot of programmers looking to simplify parallelizing their work. And unfortunately, from an engineering and design standpoint, it also greatly complicated the inner workings of celluloid.

I’ve been dreaming about celluloid classes and event flows lately. Or should I say, I’ve had nightmares about it. :)

————

Actor Axiom

An Actor is defined thusly...
When actor receives a message it can:
  * create new actors
  * send messages to actors for which it has addresses
  * say/designate what it will do with next message it receives
    ** state change?
  
  Pipelining makes sense when next message will be processed exactly
  the same as the last message (each message has no side effect on
  actor state). Therefore, it only makes sense in a "functional
  programming" sense (referential transparency).
  
  There is no requirement that actors pipeline. One-at-a-time processing
  is acceptable.
  
  Futures are Actors. Can return values or exceptions.
  
  (7:20) Put future into a message to yourself, and now you don't deadlock.
  
  One address can be used for multiple actors.
  Also, one actor may have many addresses (that potentially forward to
  each other).
  
  (11:40) When actors interact over a network, you encrypt your address and send
  it out. If a reply comes back and doesn't decrypt to a valid actor
  address then it should be rejected.
  
  Messages sent between actors are "best effort" and messages may be
  lost. However, between machines they should ack messages. Messages are
  delivered at most once (0 or 1 times), no dupes.
  
  Can use channels with put & get, but that isn't part of the Actor
  model. A channel is actually another actor. Enforcing message ordering
  can be handles by a Queue/Sequencer actor.
  
  (18:40) Actors are indeterminant.
  
  (24:00) Actors process one message at a time but it is INDETERMINANT which
  message will arrive next. For example, let's say a checking account
  has $5. From different parts of the planet we send messages to
  withdraw $4 and $3. If the $4 message arrives first, the balance
  reduces to $1 and the $3 message raises an exception. Conversely,
  if the $3 message arrived first, the balance would be $2 and the
  $4 message would raise an exception.


Tony Arcieri

unread,
Apr 26, 2017, 6:09:59 PM4/26/17
to cellulo...@googlegroups.com
On Wed, Apr 26, 2017 at 1:51 PM, <li...@chuckremes.com> wrote:
What I get from the above data points is that celluloid made an incorrect choice in preferring ATOM mode by default.

For what it's worth I agree, but apparently for different reasons
 
IMHO, most programmers are not creating purely functional actors. Instead, they are using actors to enforce serialized access to a mutable property.

Access is still serialized, however, like any other event loop-based system, it's concurrently accessed mutable state, shared among tasks that run sequentially.

There are POLS problems either way: either state mutations are confusing because "await" points are not explicit, or certain actor interactions result in a nondeterministic crash if call graphs ever contain cycles. Pick your poison.

If I had it all to do over again, I'd use an explicit async/await model like has come into vogue in the years since Celluloid was released. This gives you the best of all worlds. Barring that, yes exclusive mode probably should've been the default.

--
Tony Arcieri

li...@chuckremes.com

unread,
May 3, 2017, 9:52:40 AM5/3/17
to cellulo...@googlegroups.com
Tony,

I had a chat with Charlie/headius about Ruby Fibers and how they can’t move between threads. He thought this would be relatively easy to add to JRuby. He even mused that it was a useful property and he was surprised that MRI hadn’t adopted that yet.

Anyway, we talked more generally about celluloid and he was open to making additions that would be helpful to the library. He is very active with ruby-core and has had great success in getting MRI to adopt changes that he made in JRuby first.

It might be worth your time to chat with him on this topic if you have specific proposals for improving Thread and Fiber. He said he’s available in the jruby channel on freenode or on Gitter.

Keep us (this list) in the loop if you do end up talking with him.
Reply all
Reply to author
Forward
0 new messages