We have a situation where one of the Galera Cluster VM's lost its storage because the network link was broken to iSCSI. The MySQL connections started backing up because it couldn't "write" to disk, and eventually the cluster went into "INITIALIZED" state (i.e. down). It is a possibility that one of the other DB admins helped propagate this problem because he started trying to shutdown the systems that he perceived as getting stalled (but he couldn't log in to the node that lost its storage). We do not have any "STONITH" (Shoot The Other Node In The Head) capability setup (not even sure if Galera handles that).
So, I have one node in a five node cluster that is "running" but has no storage and is effectively "hung." I cannot login and shutdown, short of doing a "power off" operation in VMWare, which is what we ended up having to do to recover.
What would be the proper course of action to get the cluster to "survive" with minimal downtime?
Is there some command I can issue to cause the other nodes to "blacklist" (?) the bad node? Is there some configuration that could handle that automatically (not sure, that seems a bit dangerous)?
Has anyone else experienced a similar situation and come up with a solution?
Environment:
* VMWare 5.5.x (multiple hosts, with affinity rules setup to disallow multiple galera cluster nodes from occupying the same host server)
* Nimble Storage (iSCSI)
* CentOS 6.7 - latest patches
* MariaDB-Galera-server-10.0.24-1.el6.x86_64
Thanks in advance!
~tommy