SIGINT (Ctrl-C) or SIGTERM (Ctrl-Break) as the *actually* "graceful" shutdown on Windows

332 views
Skip to first unread message

Brad Aisa

unread,
May 18, 2015, 7:47:42 PM5/18/15
to consu...@googlegroups.com
I am working on a Windows installer for running consul.exe as a service on Windows. We have been using nssm.exe as a service control manager for Unix-style executables.

One issue, is that the engineer who investigated Consul has been adamant that we want to shut down consul with the SIGTERM (Ctrl-Break) and *not* SIGINT (Ctrl-C). He was very clear that despite the Consule documentation saying that Ctrl-C is the "graceful" way to shut it down, he said it causes major issues because the node leaves the cluster and can leave the cluster without a quorum and in an unrecoverable state (without manual intervention.) Our clusters are going to have in the "fews" of nodes, typically not even in the tens or more. It is critical that our deployments using a small number of nodes work properly.

Can someone concur or contradict his claim? A challenge for me is that nssm can send a Ctrl-C to shut down a proxied app, but it has no option to send Ctrl-Break.

Alternatively, does anyone know if it might be possible to add a command-line mode switch to consul so it would handle SIGINT in the SIGTERM way instead?

Thanks for any tips!

Armon Dadgar

unread,
May 19, 2015, 2:13:57 PM5/19/15
to consu...@googlegroups.com, Brad Aisa
Hey Brad,

Let me clarify the difference in behavior. When an agent exits, it can do so “gracefully” or not.
A graceful leave is done by broadcasting an intent to leave prior to exiting. The difference is when
the failure detector eventually picks up that the node is gone. If the node registered the intent to leave
it enters the “left” state, otherwise it enters the “failed” state.

A node in the failed state is continued to be considered part of the cluster, but is just unreachable.
All its services and checks remain, and the cluster attempts to re-establish contact with the node. This
is done because its impossible to distinguish between a network partition, an agent crash, a system that is
starved of CPU, etc, etc. Based on the absence of a signal, it’s provably impossible to tell what has happened.

A node in the left state however, has already told us it intends to leave. This means it is removed from the
Raft replication group, its services and checks are deregistered, and it appears to no longer be part of the cluster.
No attempt is made to establish communication with it (it did leave after all).

It’s very important to understand this distinction operationally. If you have 3 servers, and 2 or more of them leave,
you will trigger a loss of quorum / outage. If you do a rolling leave / join of all the servers (new version, config change, etc),
then the system will handle this fine. If you take a hard power loss and all servers die (NOT gracefully, they are just failed),
then when the start up again, they will re-establish quorum.

I hope that helps to clarify the situation. There is nothing disastrous about a node leaving, you just have to understand
the operation impact that it has.

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages