Cluster node fencing.

460 views
Skip to first unread message

Simone Gotti

unread,
Feb 13, 2015, 5:01:20 AM2/13/15
to coreo...@googlegroups.com
Hi,

It's from a lot of time that I have these questions and possible ideas
and I'd like to share them and get some feedback. I searched for issues
on flannel and kubernetes but I didn't found something really similar to
this.

Basically, as flannel or kubernetes can/will be used to deal with
containers that have some persistent data (typical example: a database)
and this data should be accessed only by a container at a time (to avoid
data corruption), there's the need to avoid starting multiple containers
accessing the same data.

One of the possible examples, to keep all simple and clean, is a coreos
cluster with all nodes seeing the same data (ceph rbd, cephfs, glusterfs
etc...), flannel that starts a database container on node A mounting its
data directory from the shared storage. Then node A dies/panics/lose
network connection etc..., at the moment flannel, will start the
container on another machine without any check to verify that the
previous container is really down. This is a typical split brain event.

At the moment there are various ways to avoid this. For example just
avoid shared storage and tie the container to a single machine. Then, to
get an high availability, use the database replication features and
other containers tied on other nodes as replicas, and, as a plus,
automate the master/slaves elections with various tools (for example
redis-sentinel).


Another solution, will be the make the "cluster" manager, able to really
isolate the "failed" node hosting the containers before starting another
one on a new node. This is the basic work of a "classic" ha clusters:
fencing (the ability to isolate a node).

Now I have various doubts and questions:

*) Is this something that has a sense in a "container" world where
containers can have persistent data (and become stateful). Or are there better
ways to handle this?

*) If so, where should it be implemented?
For examples on a coreos cluster there can be two, semi independent,
cluster managers: flannel and kubernetes (semi independent as kubernetes
can be launched by flannel) with different logic and requirements that
will clash if both implements a sort of fencing.

My initial idea will be to implement a "fencing" service that fences
nodes (in multiple possible ways) when it detects them unreachable and
an api that cluster managers like flannel and kubernetes can use to know
the node/minion state before initiating other operations (like starting
a container on another node).

Now there can be other and better ideas, solutions and probably a lot of
problems should be addressed but, for the moment, I want to stop here to
to hear your thoughts.

FYI, I started writing (some months ago) a basic fencing library that
you can find here (https://github.com/go-fence/fence) just with the idea
to use it for these needs.

Thanks!

Simone



Simone Gotti

unread,
Feb 13, 2015, 2:40:49 PM2/13/15
to coreo...@googlegroups.com
I really don't have to write emails at night while working with flannel...
resending the previous with a big s/flannel/fleet/g. Sorry for the confusion...

Hi,

It's from a lot of time that I have these questions and possible ideas
and I'd like to share them and get some feedback. I searched for issues
on fleet and kubernetes but I didn't found something really similar to
this.

Basically, as fleet or kubernetes can/will be used to deal with
containers that have some persistent data (typical example: a database)
and this data should be accessed only by a container at a time (to avoid
data corruption), there's the need to avoid starting multiple containers
accessing the same data.

One of the possible examples, to keep all simple and clean, is a coreos
cluster with all nodes seeing the same data (ceph rbd, cephfs, glusterfs
etc...), fleet that starts a database container on node A mounting its
data directory from the shared storage. Then node A dies/panics/lose
network connection etc..., at the moment fleet, will start the
container on another machine without any check to verify that the
previous container is really down. This is a typical split brain event.

At the moment there are various ways to avoid this. For example just
avoid shared storage and tie the container to a single machine. Then, to
get an high availability, use the database replication features and
other containers tied on other nodes as replicas, and, as a plus,
automate the master/slaves elections with various tools (for example
redis-sentinel).


Another solution, will be the make the "cluster" manager, able to really
isolate the "failed" node hosting the containers before starting another
one on a new node. This is the basic work of a "classic" ha clusters:
fencing (the ability to isolate a node).

Now I have various doubts and questions:

*) Is this something that has a sense in a "container" world where
containers can have persistent data (and become stateful). Or are there better
ways to handle this?

*) If so, where should it be implemented?
For examples on a coreos cluster there can be two, semi independent,
cluster managers: fleet and kubernetes (semi independent as kubernetes
can be launched by fleet) with different logic and requirements that
will clash if both implements a sort of fencing.

My initial idea will be to implement a "fencing" service that fences
nodes (in multiple possible ways) when it detects them unreachable and
an api that cluster managers like fleet and kubernetes can use to know

Jonathan Boulle

unread,
Feb 21, 2015, 11:19:42 PM2/21/15
to coreo...@googlegroups.com
belated driveby..


On Fri, Feb 13, 2015 at 11:40 AM, Simone Gotti <simone...@gmail.com> wrote:
fleet, will start the
container on another machine without any check to verify that the
previous container is really down. This is a typical split brain event.

Right, a system like fleet tries to provide an "at least once" SLA.

*) Is this something that has a sense in a "container" world where
containers can have persistent data (and become stateful). Or are there better
ways to handle this?

Generally the idea is to push this into a dedicated shared persistent service: Gluster, Ceph, etc...

Fencing is kind of a nasty business to get into. Why not have your applications do their own leader election directly (using etcd for example) instead?

Simone Gotti

unread,
Mar 11, 2015, 1:18:26 PM3/11/15
to coreo...@googlegroups.com
On sab, 2015-02-21 at 20:19 -0800, Jonathan Boulle wrote:
> belated driveby..

Hi Jonathan,

>
>
> On Fri, Feb 13, 2015 at 11:40 AM, Simone Gotti
> <simone...@gmail.com> wrote:
> fleet, will start the
> container on another machine without any check to verify that
> the
> previous container is really down. This is a typical split
> brain event.
>
>
> Right, a system like fleet tries to provide an "at least once" SLA.
>
>
> *) Is this something that has a sense in a "container" world
> where
> containers can have persistent data (and become stateful). Or
> are there better
> ways to handle this?
>
> Generally the idea is to push this into a dedicated shared persistent
> service: Gluster, Ceph, etc...

Yes. By now I can think of using the locking mechanism directly provided
by the shared storage (like posix locks assuming that the shared fs
client correctly handles lost lock/leases avoiding data corruptions, as
happened in older version of the nfs4 linux client).

Perhaps there can be other possible solutions I haven't thought.

>
>
> Fencing is kind of a nasty business to get into. Why not have your
> applications do their own leader election directly (using etcd for
> example) instead?
>

Really nasty business. :D I'm all for approaches like data replication
and external controlled failover avoiding any shared storage.

Assuming that I'm stubborn and I want to find at all costs a solution
for executing in an high available fashion products requiring exclusive
access to a shared storage that in a non container world should be
controlled by a clustering solution (like pacemaker using fencing to
kill unresponsive machines) with a sort of shared storage (as the apps
needs exclusive access to the storage an ha-lvm/clvmd + local filesystem
mounted is enough, but one can also use a global file systems, or ceph
or glusterfs etc...):


*) Leader election: Using etcd to do a leader election in the app is the
right way when you can write your app and it doesn't need an exclusive
access to a shared storage. I think of it as a sort of self-fencing as
the app, cooperatively blocks itself from doing operations if it's not
the leader.
But if this app needs exclusive access to a shared storage, also if the
app checks if it's the leader before doing every write, there can always
be a time window (due to slow writes, freezed processes etc...) when an
app that has lost its leadership can write to the storage when the new
leader has already started writing (causing data corruptions).

So the self fencing solution is not a valid approach but a real fencing
is needed.
For this reason I was thinking that the fencing needs to be
done/coordinated by the cluster manager (fleet/kubernets like done by
pacemaker).

Now the question is: is fencing wanted/needed or the solution is that it
should be avoided moving to alternative solution that doesn't require it
(like replication, storage locking etc...)?


Thanks!









Accela Zhao

unread,
Aug 5, 2015, 3:49:33 AM8/5/15
to CoreOS Dev
Agreed with Simone. Even we have leader selection in the app itself, it is possible that old container is still writing (but responsiveless), but new container gets elected as leader.

If we choose to use distributed lock or lock on a shared FS, there is also an problem. The newly launched app container has to wait until the old one timeout, before it can take over and write. This creates a downtime window.

Also, many apps are already existing. It is hard to modify them inside, but we have to ship them onto fleet or k8s.

So, looking forward a fencing solution provided by the cluster manager.

Brandon Philips

unread,
Aug 5, 2015, 6:59:45 PM8/5/15
to CoreOS Dev
I think Kubernetes is the best place to put this sort of functionality. I could imagine adding things to the kube service proxy to fence off people from talking who have been unscheduled. Would be good to file an issue with some use cases here: https://github.com/GoogleCloudPlatform/kubernetes/issues

Accela Zhao

unread,
Aug 6, 2015, 6:39:07 AM8/6/15
to CoreOS Dev
Reply all
Reply to author
Forward
0 new messages