Hello,
I am using MongoDB on EC2 with three hosts in a replica set
configuration. Fail over works well if I EC2 terminate, gracefully
stop, or kill -9 the mongo db master.
Today my master encountered a known bug in the Linux kernel [1] that
caused the mongod process to hang. Any process that tries to access
mongod's logging directory hangs forever, and it looks like this is
what truly hung the process. The mongod process is still listening on
port 27017, it still has established TCP connections to clients, and
most importantly it is still actively heartbeating to its peers. When
I use the shell to connect to a secondary and do "rs.status()" I see
the hung master's heartbeat incrementing steadily, and it is still in
state 1. I forced failover to occur by using iptables to block port
27017 on the hung master. I have this machine available for further
testing if you need more information.
I would like to discuss the criteria for failover. This was not a
"halting failure" - the process was alive but non-functional. So it's
a Byzantine failure, and mongod can't practically detect every such
failure. I think that "I cannot perform IO" is a reasonable case to
handle.
Can the master give itself a more thorough health check before issuing
a heartbeat? What about giving it a private, system internal capped
collection that it reads/writes every time it heartbeats? That
collection could be the oplog or a new collection. The health check
could also use client requests in a similar fashion, but mongod has to
differentiate between client mistakes and internal errors as well as
handle the case when clients are all idle. We also need to make sure
that failover doesn't happen just because IOs are slow to avoid
constantly switching masters under high IO load and similar
oscillating behaviors. This is not a trivial problem and I am curious
what the mongo developers and users think.
$ uname -a
Linux my_host A 2.6.32-308-ec2 #16-Ubuntu SMP Thu Sep 16 15:25:39 UTC
2010 x86_64 GNU/Linux
$ mongo -version
MongoDB shell version: 1.6.3
1.
https://patchwork.kernel.org/patch/120327/