One node lost storage, took down cluster

117 views
Skip to first unread message

Tommy McNeely

unread,
Apr 21, 2016, 2:12:59 PM4/21/16
to codership
We have a situation where one of the Galera Cluster VM's lost its storage  because the network link was broken to iSCSI. The MySQL connections started backing up because it couldn't "write" to disk, and eventually the cluster went into "INITIALIZED" state (i.e. down). It is a possibility that one of the other DB admins helped propagate this problem because he started trying to shutdown the systems that he perceived as getting stalled (but he couldn't log in to the node that lost its storage). We do not have any "STONITH" (Shoot The Other Node In The Head) capability setup (not even sure if Galera handles that).

So, I have one node in a five node cluster that is "running" but has no storage and is effectively "hung." I cannot login and shutdown, short of doing a "power off" operation in VMWare, which is what we ended up having to do to recover. 

What would be the proper course of action to get the cluster to "survive" with minimal downtime?

Is there some command I can issue to cause the other nodes to "blacklist" (?) the bad node? Is there some configuration that could handle that automatically (not sure, that seems a bit dangerous)?

Has anyone else experienced a similar situation and come up with a solution?


Environment:
* VMWare 5.5.x (multiple hosts, with affinity rules setup to disallow multiple galera cluster nodes from occupying the same host server)
* Nimble Storage (iSCSI)
* CentOS 6.7 - latest patches
* MariaDB-Galera-server-10.0.24-1.el6.x86_64

Thanks in advance!
~tommy

Andrew Garner

unread,
Apr 22, 2016, 5:50:52 PM4/22/16
to Tommy McNeely, codership
You can use evs.evict to manually evict a node by uuid:

http://galeracluster.com/documentation-webpages/galeraparameters.html#evs-evict

For example:

-- c1614b15-0834-11e6-ba55-620dbb1bb1f0 maps to my 'db02'

SET GLOBAL wsrep_provider_options =
'evs.evict=c1614b15-0834-11e6-ba55-620dbb1bb1f0';

You might be able to use auto-eviction for this as well, but I have
not tested that all with stalled i/o. I will certainly do so after
reading your experience. :)

http://galeracluster.com/documentation-webpages/autoeviction.html

~Andrew
> --
> You received this message because you are subscribed to the Google Groups
> "codership" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to codership-tea...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

alexey.y...@galeracluster.com

unread,
May 2, 2016, 5:50:41 PM5/2/16
to Andrew Garner, Tommy McNeely, codership
Autoeviction only works for the nodes which have poor network connection
to cluster.

In this case the node seemed to have no connectivity problems with the
other members. And as long as plain MySQL can't detect stalled IO, so
can't Galera node, so nothing can be automated.

All that had to be done there was to kill -9 the mysqld process (why try
shutdown if it can't write to storage anyway?) or kill the VM or somehow
severe connection between the node and the rest of the cluster.
Imagination is the limit ;)

Tommy McNeely

unread,
May 3, 2016, 12:28:50 AM5/3/16
to alexey.y...@galeracluster.com, Andrew Garner, codership
Hi Alexey,

We could not login to shut it down or unceremoniously kill it (-9). I am not sure if its a function of using IPA for authentication or if its common to not be able to authenticate when storage has been lost. We have experienced a few "storage" issues in the past, and this was similar. We could not login at all.

We did end up having to kill the VM itself using the virtual "reset" button in vmWare, but by then the cluster had been down for a while. We were just trying to figure out a way, should this ever happen again (and we hope it doesn't) to keep the cluster online, and not have it go into "INITIALIZED" state. 

~tommy


Tommy McNeely
IT Architect
Lark Information Technology, Inc.

Reply all
Reply to author
Forward
0 new messages