Frequent Agent not live or unreachable

1,325 views
Skip to first unread message

Nico Schottelius

unread,
May 18, 2015, 3:49:04 PM5/18/15
to consu...@googlegroups.com
Hello,

since we began to include virtual machines into our consul cluster, we frequently
see failures and recoveries just next to each other.

Our watch reports

    {
        "Node": "staticweb",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "critical",
        "Notes": "",
        "Output": "Agent not live or unreachable",
        "ServiceID": "",
        "ServiceName": ""
    }

And we find plenty of these messages in the log:

May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] memberlist: Marking dynamicweb as failed, suspect timeout reached
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] serf: EventMemberFailed: dynamicweb 136.243.52.235
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] consul: member 'dynamicweb' failed, marking health critical
May 18 20:15:56 entrance consul[1611]: 2015/05/18 20:15:56 [INFO] agent.rpc: Accepted client: 127.0.0.1:56774
May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] serf: EventMemberJoin: dynamicweb 136.243.52.235
May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] consul: member 'dynamicweb' joined, marking health alive
May 18 20:16:09 entrance consul[1611]: 2015/05/18 20:16:09 [INFO] agent.rpc: Accepted client: 127.0.0.1:56778

I've the seen the query_time parameter in serf_lan lan section of consul info,
however no configuration option on the web for consul agent. Is it possible to tune the time that is required to identify a node as being unreachable?

Cheers,

Nico

Armon Dadgar

unread,
May 18, 2015, 6:02:19 PM5/18/15
to consu...@googlegroups.com, Nico Schottelius
Hey Nico,

Those are not currently tunable, but they will be most like in the 0.6 release.
However, I’m not sure tuning them will fix anything, more of just mask an underlying routing issue.

Can you verify that you can route UDP traffic between all the peers? Almost 100% of the time
this is a networking issue at play.

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico Schottelius

unread,
May 19, 2015, 5:33:13 AM5/19/15
to Armon Dadgar, consu...@googlegroups.com, Nico Schottelius
Good morning Armon,

there has indeed been a routing problem on one of the nodes. I've fixed
it and will watch the messages, to see if that was te cause.

Thanks a lot for the pointer, Armon!

Cheers,

Nico

Armon Dadgar [Mon, May 18, 2015 at 03:01:15PM -0700]:
--
Visit Silicon Valley of Switzerland: Digital.Glarus - http://digital.glarus.ungleich.ch - @ungleich
Reply all
Reply to author
Forward
0 new messages