Hi Mongo
Experts,
Recently I’ve
been reading a lot about MongoDB's criticisms related to
write loss in some scenarios (for example http://aphyr.com/posts/284-call-me-maybe-mongodb). I wanted
to evaluate the cases which can see data loss in mongodb (since data loss is both a durability and consistency concern for us) and see if there are workarounds to these
issues before we use mongodb in production. Please note that we would like to avoid any data
loss for our use cases.
Could you
please help me with the following queries? This is a rather long post but the issues are important issues that need to be understood/handled - so please bear with the length of this post.
There are two
issues that I am concerned about:
1. Writes/Data loss in some scenarios though there is write confirmation/ack received by
the client
2. Missing
acknowledgement though data is successfully persisted - this may cause client to believe
that the write operation actually failed since there was no ack.
As per my
understanding, following are the different scenarios in which these issues can
happen:
1. Missing acknowledgement with a single/stand-alone
mongod server:
In case of a
single mongod server (i.e running without any replica sets), we can avoid data
loss by enabling journaling on the server side and making clients use journaled
acknowledged write concern. This will cause write operation to wait (for
maximum of 33 milliseconds for journal commit to trigger) and to be
acknowledged only after the write operation is flushed to the journal. There will be no
data loss/rollback once this write makes it to the journal since mongod can recover from
journal in case of any immediate crash.
However, though the data is committed to the journal, the mongod server
may get killed before acknowledgement is sent to the client. In this case, the data is
persisted but client does not hear any write confirmation/ack indicating a success. Could you please explain how a client could
detect/handle this situation? Does getLastError() in MongoDB API indicate errors in this case which may cause the client to falsely believe that the write was not successful? Would
client have to query the database and see if the write actually made through in case of
any errors like this or is there a better alternative to deal with this?
2: Data loss and missing acknowledgement
issues in a replica set:
Since we want strict consistency when using replica sets, we will write and read from primary member in the replica set. However, looks
like there are cases of data loss even when writing/reading from the primary
member in replica sets. I've considered the following different scenarios with different replica sets and write concern configurations. Note that I will not list
any cases of w=0 here since its expected for such weaker write concern to cause more
issues of durability and data loss.
2.1 Loss of write though its acknowledged with w=1, j=true: Consider a replica set with 3 members
and a client writing with w=1 and j=true (journal acknowledged write concern). Assume that the client’s write gets recorded in the journal, the client receives acknowledgement and the primary fails or is partitioned before the write was
replicated to the secondary and a new primary is now elected -- in this case, the client’s
write operation would be lost since it does not appear on the new primary though the client receives a write confirmation from the primary (and these writes are rolled back when the previous primary member rejoins the group as a secondary) .
It looks like only way to reduce the
chance of data loss in this scenario is to replicate data to majority members in the replica set before ack'ing confirmation to
the client. This would attempt to ensure that data is replicated to more members just in case primary crashes. However, I am concerned that this will badly affect the performance/latency
for the client as well as the overhead on the primary. Moreover, even with w=majority write
concern, there still seem to be data loss cases (mentioned below).
Though real network partitions are rare, it
looks like even if primary is slow to respond to heartbeats, the rest of the
cluster may misidentify it as dead or partitioned and perform another leader
election. When that happens, any writes which are accepted by the slow leader
may be both confirmed and lost. It sounds like it could happen more often in practise even in networks where there is no real network partition.
Please
let me know if there are any suggestions/workarounds to avoid these data loss
issue without affecting performance/latency.
2.2 Loss of write and no acknowledgement received by client with w=majority, j=true: Consider a replica set with 3 members
and a client writing with w=majority and j=true (journal acknowledged write
concern). Assume that the client’s write gets recorded in the journal and the primary
fails (either crashes or is partitioned) before the write was replicated to any secondary. In this case client
will not get acknowledgement since write has not been replicated to majority. A secondary instance will now take over as primary. If client/driver
now fails to new elected member, it will have to re-write the operation since the
last write didn’t make it to new primary and was not ack’ed. When the previous primary
rejoins the set as a secondary, it reverts, or “rolls back,” its write
operations to maintain database consistency with the other members. Conclusion: Since client
re-writes its operation to new secondary (perhaps based on error or timeout?),
all data is in a consistent state. I am not sure if Java driver already detects
failover transitions and internally rewrites the operation to new secondary. Thoughts?
2.3 Client does not get acknowledgement though write is persisted with w=majority, j=true: Consider
a replica set with 3 members and a client writing with w=majority and j=true [No data loss but missing acknowledgement]. Assume
that a client’s write gets recorded in the journal, gets replicated to only a
few secondary members and then primary fails before the write could be
replicated to the majority. Say one of these secondary members to which write
was replicated is now elected as the new primary. In that case, writes actually
made it to secondary and succeeded but primary failed before acknowledgement was sent to the client. So, client may see connection issues or timeout and think writes
failed but they actually succeeded. There
is no data loss here but client does not hear about the confirmation from the primary.
This may be an issue since client would try to rewrite etc believing the
previous operation failed. Could you please explain how a client can recover
from this situation?
2.4 Loss of write though it is acknowledged to client with w=majority, j=true: Consider
a replica set with 3 members and a client writing with w=majority and j=true. Assume
that a client’s write gets recorded in the journal, gets replicated to majority
secondary members and primary acknowledges to client. However, secondary may
not write this write to its journal yet (till its 100 millisecond journal
commit timer expires) thought the acknowledgement is received by the client.
If the
primary is now partitioned and majority of secondaries also restarts, the write
could be lost since it didn’t make it to secondary’s journal yet. [From MongoDB documentation: “Requiring journaled write concern in a replica set only requires a journal commit of the write operation to the primary of the set regardless of the level of replica acknowledged write concern.”] In this case,
though client is acknowledged, the writes are lost (though I agree that in practice this is rare). Could
you please explain if/how we could avoid loss of confirmed writes in this
scenario?
2.5 Loss of acknowledged write with w=2: Consider
a replica set with 5 members (n1, n2, n3, n4, n5) and a client writing with w=2: Client writes to primary n1, this gets replicated to secondary n2 (since write concern is 2) and client is
acknowledged. Now primary and this secondary ((n1, n2) are separated (as explained in http://aphyr.com/posts/284-call-me-maybe-mongodb)
from the rest of the members. In this case, some other secondary (say n5) that
didn’t see this write may be elected as new primary. In that case, writes actually
made it to n1,n2 and got acknowledged but new secondary n5 does not have
them. So, client may think writes
actually succeeded but this data will be lost/rolled back by previous primary
when it comes back up. There is data
loss though there is acknowledgement received (as discussed
in http://aphyr.com/posts/284-call-me-maybe-mongodb
)
It will be very helpful if MogoDB experts/developers can comment on how users
could avoid or workaround such situations? It is very important for our use
cases to not lose data or atleast reliably/quickly detect whether we’ve lost
data.
I am especially worried about this since posts like http://aphyr.com/posts/284-call-me-maybe-mongodb indicate that "At the same time, you should watch those rollback files. Sometimes they don't appear even though they're supposed to, and not all data types will actually be rolled back. "
Thanks,
Deepak