Remoting issue - unwanted load-induced disconnects

171 views
Skip to first unread message

oxbow_lakes

unread,
Nov 27, 2013, 8:41:34 AM11/27/13
to akka...@googlegroups.com
Following on from the discussion in this thread: https://groups.google.com/forum/#!topic/akka-user/RKf5pxGpXQU where I was asked for more detail on why a server and client were disconnecting from each other during a GC, I have made a few changes. Both server *and* client are running with these config properties:

    remote {
        log-remote-lifecycle-events = on
        gate-invalid-addresses-for = 5 s

        enabled-transports = ["akka.remote.netty.tcp"]
        netty.tcp {
            port = 0 (0 = client, 2556 server)
            maximum-frame-size = 100 MiB
        }
        watch-failure-detector {
            acceptable-heartbeat-pause = 20 s
            heartbeat-interval = 5 s
        }
        transport-failure-detector {
            acceptable-heartbeat-pause = 12 s
            heartbeat-interval = 3 s
        }
    }

Following some tuning of the server, I seem to have got the GC times down to around 5s. However, my client and server are still disconnecting - I've provided the relevant messages from client and server (attached) and am interested in your opinion on what is going on:

At 22:05 each night, the server "rolls" over to the next trading day. This will result in a deluge of messages - probably around 80,000 or more per-client (I have ~10 clients) in a period of a few seconds. I think the serialized size of each message is around 500 bytes.

What is very clear is that, at the same time, both client *and* server believed they were sending messages to the other but not receiving them. 

the client:

26-Nov-2013 22:05:10:331: [akka.akka://papagui/system/remote-watcher - FINE] [54]: Received heartbeat rsp from [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:15:062: [akka.akka://papagui/system/remote-watcher - FINE] [28]: Sending Heartbeat to [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:20:059: [akka.akka://papagui/system/remote-watcher - FINE] [46]: Sending Heartbeat to [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:25:063: [akka.akka://papagui/system/remote-watcher - FINE] [39]: Sending Heartbeat to [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:30:048: [akka.akka://papagui/system/remote-watcher - FINE] [52]: Sending Heartbeat to [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:35:049: [akka.akka://papagui/system/remote-watcher - FINE] [50]: Sending Heartbeat to [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]


the server:

[DEBUG] [11/26/2013 22:05:11.160] [gekkoRemoting-akka.actor.default-dispatcher-32] [akka://gekkoRemoting/system/remote-watcher] Received heartbeat rsp from [akka.tcp://pap...@10.210.51.121:51115]
[DEBUG] [11/26/2013 22:05:15.765] [gekkoRemoting-akka.actor.default-dispatcher-33] [akka://gekkoRemoting/system/remote-watcher] Sending Heartbeat to [akka.tcp://pap...@10.210.51.121:51115]
[DEBUG] [11/26/2013 22:05:20.777] [gekkoRemoting-akka.actor.default-dispatcher-68] [akka://gekkoRemoting/system/remote-watcher] Sending Heartbeat to [akka.tcp://pap...@10.210.51.121:51115]
[DEBUG] [11/26/2013 22:05:25.774] [gekkoRemoting-akka.actor.default-dispatcher-2] [akka://gekkoRemoting/system/remote-watcher] Sending Heartbeat to [akka.tcp://pap...@10.210.51.121:51115]
[DEBUG] [11/26/2013 22:05:30.767] [gekkoRemoting-akka.actor.default-dispatcher-89] [akka://gekkoRemoting/system/remote-watcher] Sending Heartbeat to [akka.tcp://pap...@10.210.51.121:51115]
[DEBUG] [11/26/2013 22:05:35.775] [gekkoRemoting-akka.actor.default-dispatcher-95] [akka://gekkoRemoting/system/remote-watcher] Sending Heartbeat to [akka.tcp://pap...@10.210.51.121:51115]

they then decided each other were not reachable (as you might expect):


the client:

26-Nov-2013 22:05:36:052: [akka.akka://papagui/system/remote-watcher - WARNING] [38]: Detected unreachable: [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]
26-Nov-2013 22:05:36:053: [akka.Remoting - INFO] [40]: Address [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556] is now quarantined, all messages to this address will be delivered to dead letters.
26-Nov-2013 22:05:36:161: [akka.akka://papagui/system/remote-watcher - FINE] [38]: Watchee terminated: [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556/user/repl]
26-Nov-2013 22:05:36:177: [akka.akka://papagui/system/remote-watcher - FINE] [38]: Unwatched last watchee of node: [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556]

the server:

[DEBUG] [11/26/2013 22:05:36.528] [gekkoRemoting-akka.actor.default-dispatcher-33] [akka://gekkoRemoting/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-64/endpointWriter] Disassociated [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556] <- [akka.tcp://pap...@10.210.51.121:51115]
[INFO] [11/26/2013 22:05:36.707] [gekkoRemoting-akka.actor.default-dispatcher-2] [akka://gekkoRemoting/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-11763] Message [akka.remote.transport.ProtocolStateActor$HandleListenerRegistered] from Actor[akka://gekkoRemoting/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-11763#726332147] to Actor[akka://gekkoRemoting/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-11763#726332147] was not delivered. [6073908] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[DEBUG] [11/26/2013 22:05:36.743] [gekkoRemoting-akka.actor.default-dispatcher-83] [akka://gekkoRemoting/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-64/endpointWriter] Associated [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556] -> [akka.tcp://pap...@10.210.51.121:51115]
[WARN] [11/26/2013 22:05:36.743] [gekkoRemoting-akka.actor.default-dispatcher-44] [Remoting] Tried to associate with unreachable remote address [akka.tcp://pap...@10.210.51.121:51115]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters.
[DEBUG] [11/26/2013 22:05:36.744] [gekkoRemoting-akka.actor.default-dispatcher-30] [akka://gekkoRemoting/system/remote-watcher] Watchee terminated: [akka.tcp://pap...@10.210.51.121:51115/user/canary/client]
[DEBUG] [11/26/2013 22:05:36.744] [gekkoRemoting-akka.actor.default-dispatcher-30] [akka://gekkoRemoting/system/remote-watcher] Unwatched last watchee of node: [akka.tcp://pap...@10.210.51.121:51115]
[INFO] [11/26/2013 22:05:36.744] [gekkoRemoting-akka.actor.default-dispatcher-12] [akka://gekkoRemoting/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-11763] Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://gekkoRemoting/deadLetters] to Actor[akka://gekkoRemoting/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-11763#726332147] was not delivered. [6073909] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[DEBUG] [11/26/2013 22:05:36.744] [gekkoRemoting-akka.actor.default-dispatcher-4] [akka://gekkoRemoting/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fpapagui%4010.210.51.121%3A51115-64/endpointWriter] Disassociated [akka.tcp://gekkoR...@loypcore4.intra.gsacapital.com:2556] -> [akka.tcp://pap...@10.210.51.121:51115]

Anyway, I guess this could have happened for a few reasons:

  1. Our network was swamped by the messages - they were not delivered in a timely manner. This is very unlikely - we have ample bandwidth and delays of 30s would be extreme
  2. Akka's TCP I/O layer was starved of CPU resources

I don't know much about the IO layer - I'm just using the standard TCP transport. Does it share a thread pool with the rest of the application? That is, could the large numbers of messages being sent around the system at this time meant that the server was unable to actually send the heartbeats out the door [1], or process incoming ones? Does the IO piece share an underlying thread pool with the messages flying round the "internal" system? 

What is your take on this? Any suggestions for settings that might help get around the issue very welcome!

Anyway - I'd really appreciate help with this - I'm in a lot better place than last week because the client smoothly reconnects and it's not happening more than one a day!

Chris

[1] - What does "Sending heartbeat" mean? Is this indicative that the heartbeat is on an IO queue? Or an actor's queue to be later passed to an IO queue?


akka-server-disconnect.log
akka-client-disconnect.log

Björn Antonsson

unread,
Nov 28, 2013, 5:20:44 AM11/28/13
to akka...@googlegroups.com
Hi Chris,

On Wednesday, 27 November 2013 at 14:41, oxbow_lakes wrote:

Following on from the discussion in this thread: https://groups.google.com/forum/#!topic/akka-user/RKf5pxGpXQU where I was asked for more detail on why a server and client were disconnecting from each other during a GC, I have made a few changes. Both server *and* client are running with these config properties:

    remote {
        log-remote-lifecycle-events = on
        gate-invalid-addresses-for = 5 s

        enabled-transports = ["akka.remote.netty.tcp"]
        netty.tcp {
            port = 0 (0 = client, 2556 server)
            maximum-frame-size = 100 MiB

Sending 100MiB messages can block other messages (like heartbeats) from arriving/being sent in a timely fashion.
If you starve the I/O layer, then you can configure its own dispatcher for it.\ with the setting akka.remote.use-dispatcher.

Anyway - I'd really appreciate help with this - I'm in a lot better place than last week because the client smoothly reconnects and it's not happening more than one a day!

Chris

[1] - What does "Sending heartbeat" mean? Is this indicative that the heartbeat is on an IO queue? Or an actor's queue to be later passed to an IO queue?


It means that the actor responsible for remote deathwatch has sent a heartbeat message with tell (it’s in some queue waiting to get sent over the wire).

If you need further assistance with application specific tuning or support, you can always contact Typesafe http://typesafe.com/how

B/


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Attachments:
- akka-server-disconnect.log
- akka-client-disconnect.log

-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

oxbow_lakes

unread,
Nov 28, 2013, 5:38:43 AM11/28/13
to akka...@googlegroups.com
Thanks - the very large messages are one-off "images" sent upon a client connect. They will be few and far between. Most messages are now around 200 bytes (I have a bespoke serializer).

I will tinker with the use-dispatcher mechanism! Thanks!

Chris

Ryan Tanner

unread,
Nov 29, 2013, 2:28:03 PM11/29/13
to akka...@googlegroups.com
I posted a few tips on our experiences with this problem on our blog: http://blog.goconspire.com/post/64412850756/akka-at-conspire-part-3-making-your-akka-life-easier

In our experience, the big key to stability was architecting your system in a way that makes it easy to reason about the amount of work done at any given time and therefore limit it, especially on anemic VPS instances.

Eric Pederson

unread,
Dec 2, 2013, 11:14:53 AM12/2/13
to akka...@googlegroups.com
Hi Chris -

I'm interested to hear if use-dispatcher helped for you.  We are also sending very large messages occasionally (using maximum-frame-size = 40m) and are experiencing similar issues.

Thanks,

Akka Team

unread,
Dec 4, 2013, 8:34:42 AM12/4/13
to Akka User List
Hi All,

Please note that Akka remoting uses a shared, sequenced channel (a TCP connection using the default transport) to transfer messages. This means that actors sending large messages delay messages sent by other actors (i.e no fairness). To solve this we would need chunking+subchannels support together with scheduling (WFQ or W2FQ). This might happen in the future (I want to implement it) but currently not officially planned. Until then you might want to implement chunking yourself for large messages.

-Endre
--
Akka Team
Typesafe - The software stack for applications that scale
Blog: letitcrash.com
Twitter: @akkateam

Eric Pederson

unread,
Dec 4, 2013, 11:54:58 AM12/4/13
to akka...@googlegroups.com
Hi Endre - 

Are you saying that large messages could interfere with system messages (like Remote heartbeating, etc) even if we set up a separate dispatcher?

Thanks,


-- Eric


You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/WdWJOAVXQMo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

Akka Team

unread,
Dec 5, 2013, 5:32:26 AM12/5/13
to Akka User List
Hi Eric,


Are you saying that large messages could interfere with system messages (like Remote heartbeating, etc) even if we set up a separate dispatcher?

Yes, since everything has to go through a shared TCP connection they share the same bandwidth. We obviously cannot have larger throughput than the network, so the best we might be able to do in the future is chunking/scheduling/substreams which will obviously not increase bandwidth but it will reduce latency variance as substreams see the illusion of a dedicated share of the underlying bandwidth (or possibly more when other substreams are idle).

As for the special heartbeating of the transport failure detector (not the deathwatch) it is able to inject its heartbeat messages _into_  the stream at any point, since they do not need any ordering guarantees related to any other messages. Unfortunately everything else is actor messages where we need to keep the ordering guarantees. The only way to add more fairness is to split message streams that are unrelated in ordering (e.g. targeting different actors), handle them separately, chunk them up and then multiplex together according to a schedule. SCTP has support for scenario, but it is available only in Java7 and it is not supported on every platform. In other words we probably need to roll our own transport protocol.

-Endre

Eric Pederson

unread,
Dec 5, 2013, 12:42:15 PM12/5/13
to akka...@googlegroups.com
Hi Endre - Thanks for clarifying.


-- Eric
Reply all
Reply to author
Forward
0 new messages