Strategies for detecting thread pool starvation in misbehaving Akka apps

953 views
Skip to first unread message

Richard Bradley

unread,
May 29, 2015, 7:56:26 AM5/29/15
to akka...@googlegroups.com
Hi,

I was wondering if there were any strategies or published code for detecting and dealing with thread pool starvation in Akka apps, to save me from duplicating work?

I've seen https://groups.google.com/forum/#!topic/akka-user/JrDWOPeOFy8 but it doesn't directly address detecting this issue when it occurs accidentally.

We have a high throughput Akka app, and we have occasionally encountered problems whereby a bug or a misused library has started blocking Actor threads. At high enough loads this will get to the point that all the worker threads in the thread pool are blocked and no Futures or callbacks get serviced until the next slow operation completes.

This pauses everything, including logging and network i/o, and can be difficult to spot. For example, when this happened in the past we saw only odd network errors and very slow response times, but no logging relating to the thread starvation itself, and no obvious causes in the logs.

I was thinking of adding a sentinel thread to periodically wake up and check on the queue size of the thread pool, and to log a big error with a stack dump if there were no free threads. Is there any published code which I can use to do this, or should I write my own? Is there a better strategy?

I know that IntelliJ has a sentinel thread which watches the UI thread for long pauses and automatically logs a stack dump in that case (see "Automatic thread dumps" at https://intellij-support.jetbrains.com/entries/23348667-Getting-a-thread-dump-when-IDE-hangs-and-doesn-t-respond )
Is there anything similar that already exists for Akka?

(If I do write my own, would it be useful for me to try to create a PR to add this to Akka itself? Should I get a design “signed off” somewhere first to maximise the chances of it being accepted, if so?)

Thanks,


Rich

Guido Medina

unread,
May 29, 2015, 8:11:01 AM5/29/15
to akka...@googlegroups.com
One approach I use after watching Akka days videos, was to just create small dispatchers for different types of tasks, say you have a set of actors that persist to a relational database, create these actors on your persistor-dispatcher for example, another example, say you have a single supervisor that its only jobs is to forward messages to another set of actors living in your own cache, for that a PinnedDispatcher is perfect since that supervisor only needs 1 thread for ever and it needs to be fast because the only thing it is doing is to forward messages, then there is the fine grained configuration of your remote-dispatcher, etc.

I mean, I don't think there will be a solve-all the problems approach, but pre-defining dispatchers ahead of time and have a shared conf file where every module of your system extends will help, set the standard on the project of what dispatcher should be used for what, that's at least how I have it at the moment and I don't have to worry about some insert SQL taking too long because such actor is running on a dispatcher where it doesn't matter how long it takes.

Hope that helps,

Guido.

Guido Medina

unread,
May 29, 2015, 8:21:52 AM5/29/15
to akka...@googlegroups.com
Following up on my previous post, say there is a careful thread management which interrupts a long running task, I have been there...if you are interrupting something you coded, then probably you will refactor your code so that it doesn't need to be interrupted which leads to another scenario, you are trying to interrupt an alien API, you know that changes of a bad designed API to be interrupted are from slim to none right?

That said, you will have no other choice than to isolate such calls to dedicated dispatchers, it all boils down to two things:
  1. Interruption in Java which most of the time you do not have control.
  2. Thread scheduler fairness which you also have no control.
So I strongly believe the best you can do is to "isolate" or "divide and conquer"

Hope that helps,

Guido.

Paulo "JCranky" Siqueira

unread,
May 29, 2015, 8:45:47 AM5/29/15
to akka...@googlegroups.com

I guess you covered well 'dealing with starvation'. How about detecting accidental issues?

[]s,


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Guido Medina

unread,
May 29, 2015, 9:00:38 AM5/29/15
to akka...@googlegroups.com
Another approach that may as well deal with accidental issues, is to avoid awaiting on messages, this is how I do, and again, it is working like a charm, say you have the following scenario:
  • Actor A running on persistor-dispatcher needs to insert a record on the DB and send the identity of that record, or the whole inserted new record to Actor B which needs it.
  • Actor B will at pre-start send a message to actor A requesting such record without waiting, just fire-and-forget, include the sender in your message.
  • Actor A will answer the message with its new inserted record.
  • Actor B will set context timeout for say, 1 minute and before it receives the record from Actor A, it stash all messages coming to it.
  • Actor B if receives timeout message, deal with it with your own logic, say, keep working until number of timeouts exceed a number, say, 5 times or just kill the actor because something is wrong on Actor A.
  • Actor B eventually receives the record from Actor A and at that moment it will unstash all messages and do the normal processing.
Think of it as long running state machine without blocking, you are just reacting to either the eventual message or a timeout, does that cover all scenarios?, No, but I think it is a good approach to avoid blocking and waiting.

Richard Bradley

unread,
May 29, 2015, 9:37:08 AM5/29/15
to akka...@googlegroups.com
> Another approach that may as well deal with accidental issues, is to avoid awaiting on messages,

I think you misunderstand me.

I am aware of all these strategies to manage deliberate blocking, and to avoid blocking in the first place.

I am asking about how Akka should detect the failure case when an app does not correctly manage blocking. In my case, this has happened accidentally due to bugs (in particular: a library which claimed to be non-blocking was blocking under unusual network circumstances), but it could also happen due to poor application design.


Currently, Akka does not detect when this happens, and the application just experiences pauses with no obvious developer-visible diagnosis of the cause or aid in tracking it down.

This is analogous to deadlocks in databases: all good databases will offer their users hints on avoiding deadlocks, but the best databases will detect user-caused deadlocks and emit an error log with diagnostic information that identifies the cause and gives hints on how to fix it.

Patrik Nordwall

unread,
May 29, 2015, 1:23:18 PM5/29/15
to akka...@googlegroups.com
Not exactly what you are asking for, but growing mailbox size can be a symptom of starvation. You might find this utility useful for monitoring that: https://gist.github.com/patriknw/5946678

/Patrik
--
Reply all
Reply to author
Forward
0 new messages