Faster Java client recovery?

55 views
Skip to first unread message

vikinghawk

unread,
Sep 26, 2017, 10:13:35 PM9/26/17
to rabbitmq-users
While investigating some issues with client recovery taking longer than expected (due to heavy load on server), I noticed the java client sleeps for networkRecoveryInteveral millis right away. It then attempts to connect and if it fails sleeps for networkRecoveryInteveral millis again. My question is why the initial sleep?
synchronized private void beginAutomaticRecovery() throws InterruptedException {
        Thread.sleep(this.params.getNetworkRecoveryInterval());

        this.notifyRecoveryListenersStarted();
        
        while (!connected) {
            ... attempt reconnection and sleep after a failure
        }
}

In the case where a node in the cluster is brought down (rather than network loss), it would be better to remove the 5 second (default) wait and attempt reconnect immediately. This would allow clients to failover to other nodes with less downtime.

I can submit a PR for the change, but wanted to check in here first to make sure this idea was sound and the sleep wasn't there for a specific reason.

Michael Klishin

unread,
Sep 27, 2017, 12:48:22 AM9/27/17
to rabbitm...@googlegroups.com
Because even though whatever could have interrupted the connection is almost certainly transient, it is
rare to see such issues go away in milliseconds. For the same reason we do not do retrying in a right loop.

You can greatly reduce the interval from the default one of 5s and if it works for you, by all means stick with it.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Sep 27, 2017, 12:48:56 AM9/27/17
to rabbitm...@googlegroups.com
This should read: "we do not do retrying in a tight loop".

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

vikinghawk

unread,
Sep 27, 2017, 1:13:06 AM9/27/17
to rabbitmq-users
I agree most network interruptions will not go away in milliseconds. My goal was just faster failover to other nodes in the cluster when the reason for connection loss was because a node was intentionally brought down for upgrades/maintenance. While lowering the 5 second interval would help that, that then means we are in a potentially tighter loop during actual network outages. It would depend on the type of outage tho... in several scenarios there would be socket/connect timeouts also at play that would ensure we didn't get in a tight loop.

Another option would be to implement backoff logic into the retry.

On Tuesday, September 26, 2017 at 11:48:56 PM UTC-5, Michael Klishin wrote:
This should read: "we do not do retrying in a tight loop".
On Wed, Sep 27, 2017 at 7:48 AM, Michael Klishin <mkli...@pivotal.io> wrote:
Because even though whatever could have interrupted the connection is almost certainly transient, it is
rare to see such issues go away in milliseconds. For the same reason we do not do retrying in a right loop.

You can greatly reduce the interval from the default one of 5s and if it works for you, by all means stick with it.
On Wed, Sep 27, 2017 at 5:13 AM, vikinghawk <michae...@gmail.com> wrote:
While investigating some issues with client recovery taking longer than expected (due to heavy load on server), I noticed the java client sleeps for networkRecoveryInteveral millis right away. It then attempts to connect and if it fails sleeps for networkRecoveryInteveral millis again. My question is why the initial sleep?
synchronized private void beginAutomaticRecovery() throws InterruptedException {
        Thread.sleep(this.params.getNetworkRecoveryInterval());

        this.notifyRecoveryListenersStarted();
        
        while (!connected) {
            ... attempt reconnection and sleep after a failure
        }
}

In the case where a node in the cluster is brought down (rather than network loss), it would be better to remove the 5 second (default) wait and attempt reconnect immediately. This would allow clients to failover to other nodes with less downtime.

I can submit a PR for the change, but wanted to check in here first to make sure this idea was sound and the sleep wasn't there for a specific reason.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Sep 27, 2017, 2:19:33 AM9/27/17
to rabbitm...@googlegroups.com
Exponential back-off might be a decent middle ground. My experience with it, including with RabbitMQ
(remember background GC of idle processes?), is that it is really difficult to reason about it
once you see its effect on metrics charts, in logs, etc. "This happens every … wait, no longer happens… wait, just happened again…"

What we can do is extract that part to be pluggable and default to what we do today but
let those who find exponential back-off to be worth the headaches a chance to use it.

Feel free to look into a pull request :)

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arnaud Cogoluègnes

unread,
Sep 27, 2017, 9:15:59 AM9/27/17
to rabbitm...@googlegroups.com
Created a GitHub issue: https://github.com/rabbitmq/rabbitmq-java-client/issues/308

Note it could go into 4.3.0, so PR should be submitted against 4.3.x-stable branch.

vikinghawk

unread,
Sep 27, 2017, 4:53:18 PM9/27/17
to rabbitmq-users
Reply all
Reply to author
Forward
0 new messages