client timeout errors when one node of broker cluster is down

gustavo...@gmail.com

unread,

Dec 2, 2020, 3:21:57 PM12/2/20

to Choria Users

Hi,

We have a cluster of two network brokers, and today we shutdown one node in order to test HA; the servers all connected to the surviving broker, but our clients began failing intermittently with the following error:

error 2020/12/02 19:16:24: natswrapper.rb:145:in block in start' Error in NATS connection: NATS::IO::SocketTimeoutError: NATS::IO::SocketTimeoutError error

2020/12/02 19:16:25: client.rb:39:in rescue in initialize' Timeout occured while trying to connect to middleware

The facts application failed to run, use -v for full error backtrace details: execution expired
warn 2020/12/02 19:16:25: natswrapper.rb:138:in `block in start' Disconnected from NATS: NATS::IO::SocketTimeoutError: NATS::IO::SocketTimeoutError

Approximately one of two client runs work OK.

Is this to be expected when a broker node fails? Is there any place we can inform the client that one of the brokers is down?

From the viewpoint of the servers all is fine. They all connected to the surviving broker.

Thank you

Vincent Janelle

unread,

Dec 2, 2020, 4:04:11 PM12/2/20

to choria...@googlegroups.com

Have you configured the client to use SRV records, or did you enumerate each broker in the configuration?

--
You received this message because you are subscribed to the Google Groups "Choria Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to choria-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/092fe219-ff86-4663-8f7a-cfc1f2688372n%40googlegroups.com.

Gustavo Randich

unread,

Dec 2, 2020, 4:17:42 PM12/2/20

to choria...@googlegroups.com

Are you referring to brokers' or server's (agents) configuration? We use broker enumeration in all of them. We are not using SRV records

We've modified /etc/choria/broker.conf in the surviving broker node (plugin.choria.network.peers value) and removed the failed node, then restarted the broker, to no effect, i.e. the client errors remain 50% of the runs.

Does the client uses its own configuration file?

Thank you

To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/CANcb8i86FvSWo_La1CMEM6th4J%2BhtJWeJQspft40ShumRu%3DY_A%40mail.gmail.com.

R.I.Pienaar

unread,

Dec 2, 2020, 5:38:34 PM12/2/20

to choria-users

What is happening is that the client is attempting to connect to the dead node and really we have no choice but to timeout that request after some time only. A dead node looks like a far/slow now until timeout occurs.

Not sure what would be better - since if it's configured to communicate to 2 machines and it randomly tries one of the 2 there's not much I can do.

SRV records against some form of service discovery mechanism can help, but not exactly straight forward.

Open to ideas for how to improve that.

To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/CABUq5YXzuegWvT3jQvpX9gmUD0aLvO%2B8aVxPpANXF8ZLfhgRNA%40mail.gmail.com.

--

R.I.Pienaar / www.devco.net / @ripienaar

Gustavo Randich

unread,

Dec 2, 2020, 5:56:11 PM12/2/20

to choria...@googlegroups.com

Thanks R.I.,

Where can I inform the client to skip the dead node if I'm not using SRV records? Should I reconfigure all servers and the remaning broker (/etc/coria/*.conf)?

I'm thinking not only about resilience of the client in case of a node's failure, but also of definitive failure of a node; where should the cleanup be made?

To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/1c495b1d-e69b-420b-a8fb-c56f18448f7f%40www.fastmail.com.

R.I.Pienaar

unread,

Dec 3, 2020, 3:39:02 AM12/3/20

to choria-users

This either defaults to "puppet:4222", reads mcollective_choria::config::middleware_hosts or uses SRV

So you can update your puppet code that generates the client configs if the outage is likely to last.

To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/CABUq5YVFFg8vQec2DQbSBrFHeX1rXvgkGKeSFoQ5RYKfbVGrpw%40mail.gmail.com.

gustavo...@gmail.com

unread,

Dec 3, 2020, 9:23:15 AM12/3/20

to Choria Users

Ok, mcollective_choria::config::middleware_hosts ends up in /etc/puppetlabs/mcollective/plugin.d/choria.cfg in the client machine

If the outage is transient, just removing the node there avoids the timeouts; the server's configuration is no problem because they connec to to healthy nodes only.

R.I. Pienaar

unread,

Dec 3, 2020, 12:19:02 PM12/3/20

to choria...@googlegroups.com

Servers will also try a random one so might take a few tries to get right - but the impact is less annoying since it’s long running connections

---

R.I.Pienaar

On 3 Dec 2020, at 15:23, gustavo...@gmail.com <gustavo...@gmail.com> wrote:

Ok, mcollective_choria::config::middleware_hosts ends up in /etc/puppetlabs/mcollective/plugin.d/choria.cfg in the client machine

To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/c0e1edce-1c72-4d3a-8aca-eff84b5cd90en%40googlegroups.com.

Gustavo Randich

unread,

Dec 15, 2020, 8:21:59 PM12/15/20

to choria...@googlegroups.com

It would be nice if client does not try to connect to an unresponsive broker -- IP address does not even respond.

Servers, in my experience, reconnect to healthy broker immediately, so they are not an issue.

You received this message because you are subscribed to a topic in the Google Groups "Choria Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/choria-users/erq-xPgDeAo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to choria-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/choria-users/4823AA40-75BA-419B-AED0-E6C0D5D8B9DC%40devco.net.

Reply all

Reply to author

Forward