Hello,
since we began to include virtual machines into our consul cluster, we frequently
see failures and recoveries just next to each other.
Our watch reports
{
"Node": "staticweb",
"CheckID": "serfHealth",
"Name": "Serf Health Status",
"Status": "critical",
"Notes": "",
"Output": "Agent not live or unreachable",
"ServiceID": "",
"ServiceName": ""
}
And we find plenty of these messages in the log:
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] memberlist: Marking dynamicweb as failed, suspect timeout reached
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] serf: EventMemberFailed: dynamicweb 136.243.52.235
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] consul: member 'dynamicweb' failed, marking health critical
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] agent.rpc: Accepted client:
127.0.0.1:56774May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] serf: EventMemberJoin: dynamicweb 136.243.52.235
May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] consul: member 'dynamicweb' joined, marking health alive
May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] agent.rpc: Accepted client:
127.0.0.1:56778
I've the seen the query_time parameter in serf_lan lan section of consul info,
however no configuration option on the web for consul agent. Is it possible to tune the time that is required to identify a node as being unreachable?
Cheers,
Nico