Data loss and missing acknowledgements in different scenarios with MongoDB

Deepak Kumar

unread,

May 14, 2014, 6:26:21 AM5/14/14

to mongod...@googlegroups.com

Hi Mongo Experts,

Recently I’ve been reading a lot about MongoDB's criticisms related to write loss in some scenarios (for example http://aphyr.com/posts/284-call-me-maybe-mongodb). I wanted to evaluate the cases which can see data loss in mongodb (since data loss is both a durability and consistency concern for us) and see if there are workarounds to these issues before we use mongodb in production. Please note that we would like to avoid any data loss for our use cases.

Could you please help me with the following queries? This is a rather long post but the issues are important issues that need to be understood/handled - so please bear with the length of this post.

There are two issues that I am concerned about:

1. Writes/Data loss in some scenarios though there is write confirmation/ack received by the client

2. Missing acknowledgement though data is successfully persisted - this may cause client to believe that the write operation actually failed since there was no ack.

As per my understanding, following are the different scenarios in which these issues can happen:

1. Missing acknowledgement with a single/stand-alone mongod server:

In case of a single mongod server (i.e running without any replica sets), we can avoid data loss by enabling journaling on the server side and making clients use journaled acknowledged write concern. This will cause write operation to wait (for maximum of 33 milliseconds for journal commit to trigger) and to be acknowledged only after the write operation is flushed to the journal. There will be no data loss/rollback once this write makes it to the journal since mongod can recover from journal in case of any immediate crash.

However, though the data is committed to the journal, the mongod server may get killed before acknowledgement is sent to the client. In this case, the data is persisted but client does not hear any write confirmation/ack indicating a success. Could you please explain how a client could detect/handle this situation? Does getLastError() in MongoDB API indicate errors in this case which may cause the client to falsely believe that the write was not successful? Would client have to query the database and see if the write actually made through in case of any errors like this or is there a better alternative to deal with this?

2: Data loss and missing acknowledgement issues in a replica set:

Since we want strict consistency when using replica sets, we will write and read from primary member in the replica set. However, looks like there are cases of data loss even when writing/reading from the primary member in replica sets. I've considered the following different scenarios with different replica sets and write concern configurations. Note that I will not list any cases of w=0 here since its expected for such weaker write concern to cause more issues of durability and data loss.

2.1 Loss of write though its acknowledged with w=1, j=true: Consider a replica set with 3 members and a client writing with w=1 and j=true (journal acknowledged write concern). Assume that the client’s write gets recorded in the journal, the client receives acknowledgement and the primary fails or is partitioned before the write was replicated to the secondary and a new primary is now elected -- in this case, the client’s write operation would be lost since it does not appear on the new primary though the client receives a write confirmation from the primary (and these writes are rolled back when the previous primary member rejoins the group as a secondary) .

It looks like only way to reduce the chance of data loss in this scenario is to replicate data to majority members in the replica set before ack'ing confirmation to the client. This would attempt to ensure that data is replicated to more members just in case primary crashes. However, I am concerned that this will badly affect the performance/latency for the client as well as the overhead on the primary. Moreover, even with w=majority write concern, there still seem to be data loss cases (mentioned below).

Though real network partitions are rare, it looks like even if primary is slow to respond to heartbeats, the rest of the cluster may misidentify it as dead or partitioned and perform another leader election. When that happens, any writes which are accepted by the slow leader may be both confirmed and lost. It sounds like it could happen more often in practise even in networks where there is no real network partition.

Please let me know if there are any suggestions/workarounds to avoid these data loss issue without affecting performance/latency.

2.2 Loss of write and no acknowledgement received by client with w=majority, j=true: Consider a replica set with 3 members and a client writing with w=majority and j=true (journal acknowledged write concern). Assume that the client’s write gets recorded in the journal and the primary fails (either crashes or is partitioned) before the write was replicated to any secondary. In this case client will not get acknowledgement since write has not been replicated to majority. A secondary instance will now take over as primary. If client/driver now fails to new elected member, it will have to re-write the operation since the last write didn’t make it to new primary and was not ack’ed. When the previous primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members. Conclusion: Since client re-writes its operation to new secondary (perhaps based on error or timeout?), all data is in a consistent state. I am not sure if Java driver already detects failover transitions and internally rewrites the operation to new secondary. Thoughts?

2.3 Client does not get acknowledgement though write is persisted with w=majority, j=true: Consider a replica set with 3 members and a client writing with w=majority and j=true [No data loss but missing acknowledgement]. Assume that a client’s write gets recorded in the journal, gets replicated to only a few secondary members and then primary fails before the write could be replicated to the majority. Say one of these secondary members to which write was replicated is now elected as the new primary. In that case, writes actually made it to secondary and succeeded but primary failed before acknowledgement was sent to the client. So, client may see connection issues or timeout and think writes failed but they actually succeeded. There is no data loss here but client does not hear about the confirmation from the primary. This may be an issue since client would try to rewrite etc believing the previous operation failed. Could you please explain how a client can recover from this situation?

2.4 Loss of write though it is acknowledged to client with w=majority, j=true: Consider a replica set with 3 members and a client writing with w=majority and j=true. Assume that a client’s write gets recorded in the journal, gets replicated to majority secondary members and primary acknowledges to client. However, secondary may not write this write to its journal yet (till its 100 millisecond journal commit timer expires) thought the acknowledgement is received by the client.

If the primary is now partitioned and majority of secondaries also restarts, the write could be lost since it didn’t make it to secondary’s journal yet. [From MongoDB documentation: “Requiring journaled write concern in a replica set only requires a journal commit of the write operation to the primary of the set regardless of the level of replica acknowledged write concern.”] In this case, though client is acknowledged, the writes are lost (though I agree that in practice this is rare). Could you please explain if/how we could avoid loss of confirmed writes in this scenario?

2.5 Loss of acknowledged write with w=2: Consider a replica set with 5 members (n1, n2, n3, n4, n5) and a client writing with w=2: Client writes to primary n1, this gets replicated to secondary n2 (since write concern is 2) and client is acknowledged. Now primary and this secondary ((n1, n2) are separated (as explained in http://aphyr.com/posts/284-call-me-maybe-mongodb) from the rest of the members. In this case, some other secondary (say n5) that didn’t see this write may be elected as new primary. In that case, writes actually made it to n1,n2 and got acknowledged but new secondary n5 does not have them. So, client may think writes actually succeeded but this data will be lost/rolled back by previous primary when it comes back up. There is data loss though there is acknowledgement received (as discussed in http://aphyr.com/posts/284-call-me-maybe-mongodb )

It will be very helpful if MogoDB experts/developers can comment on how users could avoid or workaround such situations? It is very important for our use cases to not lose data or atleast reliably/quickly detect whether we’ve lost data.

I am especially worried about this since posts like http://aphyr.com/posts/284-call-me-maybe-mongodb indicate that "At the same time, you should watch those rollback files. Sometimes they don't appear even though they're supposed to, and not all data types will actually be rolled back. "

Also, it would be really useful to hear comments from MongoDB developers on posts like http://aphyr.com/posts/284-call-me-maybe-mongodb that talk how MongoDB can loss data/acknowledgement in some cases.

Thanks,

Deepak

Bhanu Choudhary

unread,

May 15, 2014, 10:57:09 AM5/15/14

to mongod...@googlegroups.com

Hi Deepak,

This post (http://aphyr.com/posts/284-call-me-maybe-mongodb) has actually got me thinking if mongodb is the right product for my requirements.

Can please MongoDB developers clarify the above points to reinstate my faith in MongoDB please?

Many Thanks!

-Bhanu

Abhi

unread,

May 15, 2014, 2:09:52 PM5/15/14

to mongod...@googlegroups.com

Hi,

I am also considering mongodb for my usecases and have done some testing also. But didn't focus much on this aspect as of now. Your post made me realise that I should really focus on this aspect also.

It would be helpful if someone from mongodb devs can comment on this and if possible suggest ways to detect and possibly avoid data loss. As data is critical for my use case and I am really inclined to use mongodb because of many features it offers.

Thanks,

Abhi

Asya Kamsky

unread,

May 16, 2014, 3:11:12 AM5/16/14

to mongodb-user

Deepak,

Great set of questions, I'm going to make some quick points since I don't have time right now to address each question in detail but hopefully we'll get through them in the next few days.

1. **All** databases suffer from the second problem you discuss (write is successful even though it wasn't acknowledged). All of them. It is not possible to design a database (with distributed client(s)) which does not have this scenario possible:

T1. begin transaction
T2. do some work
T3. commit transaction

Now instead of "ack" you get a network exception. Did the transaction commit or not? You don't know, and there is no way to know.

2. Kyle Kingsbury has written some incredibly insightful and valuable posts. I love the entire Distributed-Systems series. Please, everyone asking follow-up questions in this thread, do yourself a favor and read *all* of them. The "Call Me Maybe" series starts with PostgreSQL and ends with ... well, after the end, Kyle reviewed some more systems, and I hope he doesn't stop because this information is GOLD.

When Kyle tested MongoDB (version 2.4.1) he uncovered a couple of serious bugs: SERVER-9339 (the bug responsible for missing rollback files in some circumstances) and a second bug (can't locate it in Jira at the moment) which caused either the driver to miss the error or the server to improperly populate the returned writeConcern object - basically normally the field "ok" is either 0 or 1 and if it's 0 (error) then "err" field is also populated and there was a case where the two were inconsistent or inconsistently parsed by the driver, I can't remember which.

Both of these bugs were fixed - this was not part of the design (unlike some other things he writes about, or inability of any system to acknowledge a successful write if the client is network partitioned from the server, or known by design limitations of the system).

3. Your concern about performance with majority write concern is surprising, given that you seem okay with journal acknowledged write concern locally. Barring situations where your replica nodes are significantly far from each other, replica acknowledgement would usually be faster (lower latency) than disk acknowledgement (possibly not much faster if you use SSDs, but certainly faster if you use spinning drives or EBS type storage).

Of course the important part to remember is that "higher" write concern only affects the *latency* of each individual operation - it should have minimal, if any, effect on your *throughput* - assuming you are using enough parallelism in your client to drive enough work to the database.

Asya

P.S. one more quick note. You write:

2.4 Loss of write though it is acknowledged to client with w=majority, j=true: Consider a replica set with 3 members and a client writing with w=majority and j=true. Assume that a client’s write gets recorded in the journal, gets replicated to majority secondary members and primary acknowledges to client. However, secondary may not write this write to its journal yet (till its 100 millisecond journal commit timer expires) thought the acknowledgement is received by the client. [...]

It is not possible to lose an acknowledged-by-majority write in this scenario without losing write ability to the cluster. Another way of putting it, if you have three replica members, and two of them receive the write, if you are able to have a primary elected after some failure mode, then by definition, that write exists. If all the nodes that got that write (majority) go down, then you cannot have a new primary elected (no majority available for election). In this case, DBA intervention may be required and if they bring up the failed primary first, since it has the write journaled, it will apply it as part of its restart and it will be most "up-to-date". Only when secondary is brought up first (or restarts on its own as you proposed) and it plus the remaining secondary have an election *and* accept new writes from the client before the original primary restarts will the write be rolled back (but still not *lost*). So I'm not saying there can't be edge cases you can construct, but I believe that in all of those cases you will end up losing the primary, meaning all the data won't be accessible by the application at that point. This particular edge case though requires that primary crash after journal flush but not before sending back write ack, and secondary crashed after acknowledgement but before journal flush (and also restarted before the original primary). But just so you know we are concerned about this possibility as well, there is https://jira.mongodb.org/browse/SERVER-5218 which tracks the feature request to be able to specify waiting for journal on all "w" nodes before acknowledgement.

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/555fa5ad-16c7-45b8-93fc-6ca2e6946295%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bhanu Choudhary

unread,

Jun 4, 2014, 2:48:04 AM6/4/14

to mongod...@googlegroups.com

Hi,

Can somebody please assess each of these scenarios individually and conclude this topic please? I have a feeling that this is going to be pretty insightful and exciting stuff !

Thanks,

Bhanu

Asya Kamsky

unread,

Jun 9, 2014, 5:37:13 PM6/9/14

to mongodb-user

Having re-read the thread, I'm not really sure what there is to conclude.

I maintain that everyone needs to understand the needs and
requirements of their application/use case and use appropriate
settings, no matter what tools/DBs/etc they are using...

Asya

> https://groups.google.com/d/msgid/mongodb-user/1497aec8-f555-440b-b0b5-7ccef358f5fa%40googlegroups.com.

Deepak Kumar

unread,

Jun 11, 2014, 9:55:56 AM6/11/14

to mongod...@googlegroups.com

Hi Asya,

Could you please opine on how users should be handling each of these scenarios to avoid the problem detailed in it?

Also, can somebody from MongoDB team please put together these list of possible issues along with recommendations on how users should handle these somewhere on the official mongodb documentation? There are lot of places on the web that talk about data loss especially loss of confirmed/acknowledged writes in mongodb but there is no place that provides comprehensive list of issues along with recommended solutions on settings to use. This will be really helpful for current and future users of mongodb.