Frequent Agent not live or unreachable

Nico Schottelius

unread,

May 18, 2015, 3:49:04 PM5/18/15

to consu...@googlegroups.com

Hello,

since we began to include virtual machines into our consul cluster, we frequently

see failures and recoveries just next to each other.

Our watch reports

{

"Node": "staticweb",

"CheckID": "serfHealth",

"Name": "Serf Health Status",

"Status": "critical",

"Notes": "",

"Output": "Agent not live or unreachable",

"ServiceID": "",

"ServiceName": ""

}

And we find plenty of these messages in the log:

May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] memberlist: Marking dynamicweb as failed, suspect timeout reached

May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] serf: EventMemberFailed: dynamicweb 136.243.52.235

May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] consul: member 'dynamicweb' failed, marking health critical

May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] agent.rpc: Accepted client: 127.0.0.1:56774

May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] serf: EventMemberJoin: dynamicweb 136.243.52.235

May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] consul: member 'dynamicweb' joined, marking health alive

May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] agent.rpc: Accepted client: 127.0.0.1:56778

I've the seen the query_time parameter in serf_lan lan section of consul info,

however no configuration option on the web for consul agent. Is it possible to tune the time that is required to identify a node as being unreachable?

Cheers,

Nico

Armon Dadgar

unread,

May 18, 2015, 6:02:19 PM5/18/15

to consu...@googlegroups.com, Nico Schottelius

Hey Nico,

Those are not currently tunable, but they will be most like in the 0.6 release.

However, I’m not sure tuning them will fix anything, more of just mask an underlying routing issue.

Can you verify that you can route UDP traffic between all the peers? Almost 100% of the time

this is a networking issue at play.

Best Regards,

Armon Dadgar

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico Schottelius

unread,

May 19, 2015, 5:33:13 AM5/19/15

to Armon Dadgar, consu...@googlegroups.com, Nico Schottelius

Good morning Armon,

there has indeed been a routing problem on one of the nodes. I've fixed
it and will watch the messages, to see if that was te cause.

Thanks a lot for the pointer, Armon!

Cheers,

Nico

Armon Dadgar [Mon, May 18, 2015 at 03:01:15PM -0700]:

--
Visit Silicon Valley of Switzerland: Digital.Glarus - http://digital.glarus.ungleich.ch - @ungleich

Reply all

Reply to author

Forward