client timeout errors when one node of broker cluster is down

158 views
Skip to first unread message

gustavo...@gmail.com

unread,
Dec 2, 2020, 3:21:57 PM12/2/20
to Choria Users
Hi,

We have a cluster of two network brokers, and today we shutdown one node in order to test HA; the servers all connected to the surviving broker, but our clients began failing intermittently with the following error:

error 2020/12/02 19:16:24: natswrapper.rb:145:in block in start' Error in NATS connection: NATS::IO::SocketTimeoutError: NATS::IO::SocketTimeoutError error
2020/12/02 19:16:25: client.rb:39:in rescue in initialize' Timeout occured while trying to connect to middleware

The facts application failed to run, use -v for full error backtrace details: execution expired
warn 2020/12/02 19:16:25: natswrapper.rb:138:in `block in start' Disconnected from NATS: NATS::IO::SocketTimeoutError: NATS::IO::SocketTimeoutError 
 
Approximately  one of two client runs work OK.

Is this to be expected when a broker node fails? Is there any place we can inform the client that one of the brokers is down?

From the viewpoint of the servers all is fine. They all connected to the surviving broker.

Thank you

Vincent Janelle

unread,
Dec 2, 2020, 4:04:11 PM12/2/20
to choria...@googlegroups.com
Have you configured the client to use SRV records, or did you enumerate each broker in the configuration?

--
You received this message because you are subscribed to the Google Groups "Choria Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to choria-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/092fe219-ff86-4663-8f7a-cfc1f2688372n%40googlegroups.com.

Gustavo Randich

unread,
Dec 2, 2020, 4:17:42 PM12/2/20
to choria...@googlegroups.com
Are you referring to brokers' or server's (agents) configuration? We use broker enumeration in all of them. We are not using SRV records

We've modified /etc/choria/broker.conf in the surviving broker node (plugin.choria.network.peers value) and removed the failed node, then restarted the broker, to no effect, i.e. the client errors remain 50% of the runs.

Does the client uses its own configuration file?

Thank you



R.I.Pienaar

unread,
Dec 2, 2020, 5:38:34 PM12/2/20
to choria-users
What is happening is that the client is attempting to connect to the dead node and really we have no choice but to timeout that request after some time only. A dead node looks like a far/slow now until timeout occurs.

Not sure what would be better - since if it's configured to communicate to 2 machines and it randomly tries one of the 2 there's not much I can do.

SRV records against some form of service discovery mechanism can help, but not exactly straight forward.

Open to ideas for how to improve that.

Gustavo Randich

unread,
Dec 2, 2020, 5:56:11 PM12/2/20
to choria...@googlegroups.com
Thanks R.I.,

Where can I inform the client to skip the dead node if I'm not using SRV records? Should I reconfigure all servers and the remaning broker (/etc/coria/*.conf)?

I'm thinking not only about resilience of the client in case of a node's failure, but also of definitive failure of a node; where should the cleanup be made?



R.I.Pienaar

unread,
Dec 3, 2020, 3:39:02 AM12/3/20
to choria-users
This either defaults to "puppet:4222", reads mcollective_choria::config::middleware_hosts or uses SRV

So you can update your puppet code that generates the client configs if the outage is likely to last.

gustavo...@gmail.com

unread,
Dec 3, 2020, 9:23:15 AM12/3/20
to Choria Users
Ok, mcollective_choria::config::middleware_hosts ends up in /etc/puppetlabs/mcollective/plugin.d/choria.cfg in the client machine

If the outage is transient, just removing the node there avoids the timeouts; the server's configuration is no problem because they connec to to healthy nodes only.

R.I. Pienaar

unread,
Dec 3, 2020, 12:19:02 PM12/3/20
to choria...@googlegroups.com
Servers will also try a random one so might take a few tries to get right - but the impact is less annoying since it’s long running connections 



---
R.I.Pienaar

On 3 Dec 2020, at 15:23, gustavo...@gmail.com <gustavo...@gmail.com> wrote:

Ok, mcollective_choria::config::middleware_hosts ends up in /etc/puppetlabs/mcollective/plugin.d/choria.cfg in the client machine

Gustavo Randich

unread,
Dec 15, 2020, 8:21:59 PM12/15/20
to choria...@googlegroups.com
It would be nice if client does not try to connect to an unresponsive broker -- IP address does not even respond.
Servers, in my experience, reconnect to healthy broker immediately, so they are not an issue.


You received this message because you are subscribed to a topic in the Google Groups "Choria Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/choria-users/erq-xPgDeAo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to choria-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/4823AA40-75BA-419B-AED0-E6C0D5D8B9DC%40devco.net.
Reply all
Reply to author
Forward
0 new messages