Periodic "Disassociated" with remote system

Caoyuan

unread,

Aug 10, 2014, 12:08:00 PM8/10/14

to akka...@googlegroups.com

We have an akka cluster with 10 nodes. it works almost smoothly except periodic firing "Disassociated" WARN log, which seems cannot be recovered:

The following is the log records.

......

2014-08-10 00:00:09,253 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:00:44,292 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:01:49,332 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:02:24,373 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:02:59,412 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:03:34,452 WARN a.remote.ReliableDeliverySupervisor akka.tcp://Cluste...@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://Cluste...@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

......

The warning continually occurred almost all day, with the period 35 seconds (30 + 5 s) or 65 seconds (30 + 30 + 5 s), which is exactly the setting of akka.remote's transport failure detector:

akka.remote {

transport-failure-detector {

heartbeat-interval = 30 s # default 4s

acceptable-heartbeat-pause = 5 s # default 10s

}

Where, the failure-detector mark it unavailable after heartbeat-interval + acceptable-heartbeat-pause period (35 s).

We're using akka-2.3.3. the node which logged is at 10.0.69.169:2551, and the remote node is at 10.0.65.3:2552

I tried to dig via the akka.remoting source code, but with no progressing.

Thoughts ?

-Caoyuan Deng

Patrik Nordwall

unread,

Aug 11, 2014, 6:21:21 AM8/11/14

to akka...@googlegroups.com

Hi Caoyuan,

Do you see the same thing with Akka version 2.3.4 and changing the transport-failure-detector settings to default?

Regards,

Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--

Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw

Caoyuan

unread,

Aug 11, 2014, 8:39:42 AM8/11/14

to akka...@googlegroups.com

On Mon, Aug 11, 2014 at 6:21 PM, Patrik Nordwall <patrik....@gmail.com> wrote:

Hi Caoyuan,

Do you see the same thing with Akka version 2.3.4 and changing the transport-failure-detector settings to default?

Will try. But I don't see any relevance source changes between 2.3.3 and 2.3.4

Caoyuan

unread,

Aug 25, 2014, 3:31:59 AM8/25/14

to akka...@googlegroups.com

Update Aug 25, 2014:

We changed akka.remote.transport-failure-detector.acceptable-heartbeat-pause = 10 s instead of 5 s, the WARN message gone. I guess the [Disassociated] WARN might be caused by network delay or GC pause (Full GC lasts 3+ secs now on our system) etc. The setting is

akka.remote {

transport-failure-detector {

heartbeat-interval = 30 s # default 4s

acceptable-heartbeat-pause = 10 s # default 10s

}

But, that could not explain the periodic "Disassociated" WARN occurred before, which, seems could not be recovered from Disassociated state.

On Monday, August 11, 2014 12:08:00 AM UTC+8, Caoyuan wrote:

We have an akka cluster with 10 nodes. it works almost smoothly except periodic firing "Disassociated" WARN log, which seems cannot be recovered:

The following is the log records.

......

2014-08-10 00:00:09,253 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:00:44,292 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:01:49,332 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:02:24,373 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:02:59,412 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2014-08-10 00:03:34,452 WARN a.remote.ReliableDeliverySupervisor akka.tcp://ClusterSystem@10.0.69.169:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.65.3%3A2552-5 - Association with remote system [akka.tcp://ClusterSystem@10.0.65.3:2552] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Akka Team

unread,

Aug 25, 2014, 6:31:15 AM8/25/14

to Akka User List

Hi Caouyan,

It is usually dangerous to set the heartbeat-pause to a lesser value than the heartbeat interval itself. If a heartbeat gets lost, then the next heartbeat will definitely not make the deadline. I recommend to set it to a larger value. Also, I would go with a lower heartbeat-interval setting, 10s seems more appropriate if you want low heartbeat traffic.

-Endre

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--

Akka Team

Typesafe - The software stack for applications that scale

Blog: letitcrash.com
Twitter: @akkateam

Caoyuan

unread,

Aug 26, 2014, 3:51:45 AM8/26/14

to akka...@googlegroups.com

On Monday, August 25, 2014 6:31:15 PM UTC+8, Akka Team wrote:

Hi Caouyan,

It is usually dangerous to set the heartbeat-pause to a lesser value than the heartbeat interval itself. If a heartbeat gets lost, then the next heartbeat will definitely not make the deadline. I recommend to set it to a larger value. Also, I would go with a lower heartbeat-interval setting, 10s seems more appropriate if you want low heartbeat traffic.

-Endre

Got it now. Thanks.

BTW, Our cluster has ran 15 days with 1 million long-connections, stable and consistent.

√iktor Ҡlang

unread,

Aug 26, 2014, 4:37:12 AM8/26/14

to Akka User List

On Aug 26, 2014 9:51 AM, "Caoyuan" <dcao...@gmail.com> wrote:

On Monday, August 25, 2014 6:31:15 PM UTC+8, Akka Team wrote:
Hi Caouyan,

It is usually dangerous to set the heartbeat-pause to a lesser value than the heartbeat interval itself. If a heartbeat gets lost, then the next heartbeat will definitely not make the deadline. I recommend to set it to a larger value. Also, I would go with a lower heartbeat-interval setting, 10s seems more appropriate if you want low heartbeat traffic.

-Endre

Got it now. Thanks.

BTW, Our cluster has ran 15 days with 1 million long-connections, stable and consistent.

Awesome

Roland Kuhn

unread,

Aug 26, 2014, 10:10:27 AM8/26/14

to akka-user

Hi Caoyuan,

26 aug 2014 kl. 09:51 skrev Caoyuan <dcao...@gmail.com>:

On Monday, August 25, 2014 6:31:15 PM UTC+8, Akka Team wrote:
Hi Caouyan,

It is usually dangerous to set the heartbeat-pause to a lesser value than the heartbeat interval itself. If a heartbeat gets lost, then the next heartbeat will definitely not make the deadline. I recommend to set it to a larger value. Also, I would go with a lower heartbeat-interval setting, 10s seems more appropriate if you want low heartbeat traffic.

-Endre

Got it now. Thanks.

BTW, Our cluster has ran 15 days with 1 million long-connections, stable and consistent.

That’s great to hear, and it does make me a bit curious about the rest of the story: care to share it privately or even publicly?

Regards,

Roland

Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn

Caoyuan

unread,

Aug 26, 2014, 10:38:21 AM8/26/14

to akka...@googlegroups.com

Hi Roland,

The cluster is based on https://github.com/wandoulabs/spray-socketio. We, Wandou Labs ( http://www.snappea.com/ ), are going to use it for at least 10+ millions persistent connections, from mobile devices to our service. These mobile devices can then, share status, push messages, fire real-time events, virtually connect to each others etc.

Feel free for more questions :-)

Regards,

Caoyuan Deng ( https://github.com/dcaoyuan )

Roland Kuhn

unread,

Aug 27, 2014, 11:20:17 AM8/27/14

to akka-user

26 aug 2014 kl. 16:38 skrev Caoyuan <dcao...@gmail.com>:

Hi Roland,

The cluster is based on https://github.com/wandoulabs/spray-socketio

This looks very interesting! We will in the near-to-midterm future set out to add WebSocket support to Akka HTTP, I guess gathering inspiration will not be forbidden ;-)

. We, Wandou Labs ( http://www.snappea.com/ ), are going to use it for at least 10+ millions persistent connections, from mobile devices to our service. These mobile devices can then, share status, push messages, fire real-time events, virtually connect to each others etc.

This is a very cool use-case, please let us know how it works out and which obstacles you encounter!