How to detect if service is down due to the server crash

562 views
Skip to first unread message

wolfgang

unread,
Jan 28, 2016, 11:12:25 PM1/28/16
to Consul
Hi there,

Here is my situation.
I have two servers.
Consul agent is running on each server and communicating each other.

A service is running on the first server.
Health check is correctly setup with consul agent so I can notice (Consul agent on the second server is watching the service's state ) even if the service itself is down.

But how about if the first server gets crashed ?
It means not only the service but consul agent is also down so health checking of course won't be working.
I guess the service will be abandoned as a healthy state in the service list.

How can I detect the service is down (critical) from the second server.

Thanks in advance
wolfgang




Armon Dadgar

unread,
Jan 29, 2016, 9:51:46 PM1/29/16
to consu...@googlegroups.com, wolfgang
Wolfgang,

We recommend running at least 3 Consul servers, in the case of a 2 node deployment
if either server fails the cluster is unavailable. Consul requires a majority of the servers
to be available, so a 3 node cluster can have a single failure, while a 5 node cluster can
have a 2 node failure.

The agents themselves (client and server) participate in a background gossip protocol
that detects machine level failure. So if the agent or machine crashes, the entire node
is marked as having failed, including all the services running on that node. In the catalog
this shows up as the “serfHealth” check, which is a special builtin.

Hope that helps!

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/f56d4b5e-664f-418c-87db-f6b81615c938%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

veerbhan tahlani

unread,
Jul 20, 2017, 10:06:54 AM7/20/17
to Consul, wolfg...@gmail.com
Hi Armon,

If we have a 3 server(A, B and C) consul cluster and A sever is down then services registered with server A wouldn't be available if I query using health end point? If yes, why? Consul servers replicate details to reach other, right?

What information consul server replicate between them self?

Armon Dadgar

unread,
Jul 20, 2017, 6:51:21 PM7/20/17
to consu...@googlegroups.com, veerbhan tahlani, wolfg...@gmail.com
The Consul servers replicate all their state between themselves. They don’t shard the data, so if you lose server A, servers B/C have the full catalog of all the services. The Consul architecture overview might be helpful (https://www.consul.io/docs/internals/architecture.html).

Best Regards,
Armon Dadgar

veerbhan tahlani

unread,
Jul 20, 2017, 10:28:35 PM7/20/17
to Armon Dadgar, wolfg...@gmail.com, consu...@googlegroups.com
If I use catalog to get service details then I am not sure if service is healthy or not, actually service might be down and my request endup failing.

Is there anyway to replicate health checks between servers? Or shift data from server (which is down) to other servers(which are up) when any server goes down.

To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool+unsubscribe@googlegroups.com.

James Phillips

unread,
Jul 21, 2017, 12:26:47 PM7/21/17
to consu...@googlegroups.com, Armon Dadgar, wolfg...@gmail.com
.Hi Veerbhan,

Please do not cross-post questions. I see similar questions now on
about 5 GitHub issues and that makes it difficult to answer in one
place, and dilutes the ability of the community to find answers later.

The servers do replicate the health check information across all the
servers, so if the leader Consul server goes down then another Consul
server will elect a leader and take over, and will have a complete set
of state to work with. There's nothing you need to do as an operator
to shift any data to the other Consul servers.

From your other posts it sounded like you were concerned if the
serfHealth check fails on a Consul client whether you'd lose all the
services on that client's node. In general, yes, the serfHealth check
is implicitly AND-end with every service check on that node. We've
worked hard to ensure that the serfHealth check doesn't give false
positives (https://www.hashicorp.com/blog/making-gossip-more-robust-with-lifeguard/).

If you are using Consul's DNS interface then yes, Consul will no
longer include any service instances from that failed node in DNS
requests. Usually things connecting to that node will start to
experience errors, will make another DNS request, and then will get a
different, healthy instance of the service because the failed node
isn't included in the results because of the serfHealth check being
failed. If you are using DNS and somehow the serfHealth check failed
incorrectly it shouldn't harm any existing connections to your
services, but it would prevent new connections over that period.

If you are using the Consul APIs directly like
https://www.consul.io/api/health.html#list-nodes-for-service you will
get back that the serfHealth check is failed and its up to you to
interpret that in your application.

> Can you please suggest, how can I get services which are healthy even if their node is down/failing?

Consul's not set up to do that out of the box, at least if you are
using the DNS interface. If you wanted to ignore the serfHealth check,
which is definitely not recommended, then you could use the health API
and filter it out there. It's a pretty deep part of Consul and
fundamental to correct operation of Consul's edge-triggered health
check update model as well as proper operation of locks and sessions,
so I don't think we'd be able to provide a way to disable the
serfHealth check. This talk explains why we have the serfHealth check
in lots of detail - https://www.youtube.com/watch?v=CDQaqiRhTtk.

Hope that helps!

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/CAKpDd0o1J1UJxc7XwQofeJgX1CwASoFmJ75%2BRKwCGNSMhQ66gQ%40mail.gmail.com.

veerbhan tahlani

unread,
Jul 24, 2017, 9:04:53 AM7/24/17
to Consul, armon....@gmail.com, wolfg...@gmail.com
Hi James,

If I use "health/service" i do get result, but i am not sure if services are healthy anymore as services health is not updated if the node, on which services are registered, goes down.

Can all servers not perform health check for all the services so if any server goes down we will still be sure if service is healthy or not as consul can use quorum here.


Thanks
Veerbhan
Reply all
Reply to author
Forward
0 new messages