Akka remote hangs and stop sending heartbeat

41 views
Skip to first unread message

Grégory Marti

unread,
Jun 21, 2016, 7:51:40 AM6/21/16
to Akka User List
Hi all,

We have build an application with >15 nodes doing some distributed concurrent work. All of them controlled by a master node talking through akka remote.

They are working 24/24h and it worked flawlessly for 2 years.

We are using akka 2.4.7 deployed on Ubuntu 14.04.4 LTS.

Since 2-3 month, we start loosing connection between the master and the workers and now it happen every ~3 days.

I have enabled the logs on the controller and one workers to see what's happening and it's a bit strange i could not explain it.

The controller stop sending deathwatch heartbeat for ~1 min like if it is frozen or hanging. At the other side, the worker send heartbeat but don't receive answers and quarantine the controller. After the freeze, the controller continue where he left to send heartbeat.

Edit : I have found that it's hanging for ~6 sec sometimes too (not making problem with deathwatch)

I have checked my network the 2 ways and the problem is not network related.

I have dispatchers(2-3) for nearly every task.. but some task use default dispatcher

From what I know heartbeat are sent by akka.remote.default-remote-dispatcher and should not be disturbed by the other jobs.

Remote is mostly used for control message so there is not many of them... but sometimes there could be "big" messages (akka.remote.netty.tcp.maximum-frame-size = 536870912)

Is it possible that the default remote dispatcher should need some tuning ? 

Anything else that could make heartbeat hang for 1 minute ?

Thank you 

Grégory



Patrik Nordwall

unread,
Jun 21, 2016, 8:01:46 AM6/21/16
to akka...@googlegroups.com
What Akka version did you use when you didn't have this problem?
Sending 500 MB messages is not recommended. That will prevent heartbeat messages to get through. Are the issues correlated to when you send those large messages?

/Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--

Patrik Nordwall
Akka Tech Lead
Lightbend -  Reactive apps on the JVM
Twitter: @patriknw

Grégory Marti

unread,
Jun 22, 2016, 9:06:35 AM6/22/16
to Akka User List
Akka 2.4.1 (11.12.2015)

Moved to Akka 2.4.2 (24.03.2016) 

Problems have arise between 24.03 and 26.05 (probably end april) but system was not fully running before this date

Then Akka 2.4.6 the 26.05.2016 (To try to resolve the problem)

Akka 2.4.7 the 08.06.2016


Problem seems to appear near 3 days after been started and never sooner or later


Sending 500 MB messages is not recommended.

Yeah... we will remove big messages, i think it's not related, because triggered by users and we already had the problem at night.

I will do some test to see if those big messages create an hang with the heartbeat.... 

Changing parallelism-min/max of default-remote-dispatcher could help with this kind of problem ? (I know we need to find the root cause.... but a trick is welcome in between)

Viktor Klang

unread,
Jun 22, 2016, 9:09:00 AM6/22/16
to Akka User List

Does it work flawlessly still on 2.4.1? (I.e. are we looking at a symptom or a cause)

--
Cheers,

Reply all
Reply to author
Forward
0 new messages