Stale reads from MongoDB 3.0 (still?), a bit of a disheartening post

s.molinari

unread,

Apr 24, 2015, 2:00:55 AM4/24/15

to mongod...@googlegroups.com

I read this blog post yesterday and also the bug posted by the author and I'd like to make sure I understand the issue correctly.

If you have a replica set and if you have a network partition split your replica set nodes and a new primary is elected...

then.....

a client could read stale data from the new primary, even if write concern was set to majority?

It seems to me, there must also be other conditions in the timing of the reads, to make this possible (like really fast activity?). Still, it wasn't denied that the issue is there. It may be a rare condition to get, but at the same time, reading stale mission critical data is a no-go.

There are also other "people" jumping into the comments of that post saying their database doesn't have this issue. I thought a database can't be distributed and cover all three parts of CAP anyway. I always knew a new primary might have lost data, but I never considered stale data, which is much worse. I'd much rather have the application balk, because expected data is missing, than have wrong data enter into the application. That could be devastating.

The other thing I am uncertain about is, a situation was discussed in the bug by MongoDB staff that there could even be the availability of two primaries at one point. Would a client transact with two different primaries at the same time? That is a split brain condition, isn't it? That would be the worst issue of all, if it is possible. Can split brain happen to a Mongo replica set?

If I want to keep up consistency with very mission critical data, until (if and when) the read consistency issue gets fixed, I could do it on the application side. What would be the best method to do so?

One a side note, I am not at all experienced with networking issues and was hoping a network partition splitting up a replica set would be a very, very rare occasion, but it seems, from the poll I made in the G+ Mongo community, it isn't unfortunately. It isn't a large representation, but if it is representative, this issue can't be ignored.

My intention of this post is not to knock MongoDB, but to learn. I am still a MongoDB fan. Just wanted to make sure that is clear. :-)

Scott

Asya Kamsky

unread,

Apr 24, 2015, 3:23:39 AM4/24/15

to mongod...@googlegroups.com

Hi Scott,

I'm going to try to get the time to write up a more detailed response, but as a shorter response:

1. The issues being discussed are all related to distributed systems. Someone saying their non-distributed DB doesn't have this problem is probably missing this point.

2. This issue is a result of having automatic failover in a distributed replica set. The nature of networks is that information exchange cannot be instantaneous. This means there can be scenarios where (momentarily) there can be two primaries (new one is elected before the old one realizes it should step down).

3. The issue of stale reads is complex, and probably both MongoDB drivers and applications can take steps to minimize or eliminate the possibility of it. Kyle describes one potential workaround - and it's one that queue implementations normally use in MongoDB - using a write for a read, i.e. findAndModify, where majority writeConcern satisfaction guarantees that the node was legitimate primary when it read the data it delivered to you. 3.2 will have a number of improvements in the server and hopefully in the drivers which will minimize the possible scenarios that applications will have to worry about implementing separately.

Note a correction though - the issue of client reading stale data from *new* primary does not exist - that's the issue of uncommitted read from the old primary. When read-committed is available (in 3.2), the remaining possible issue will be possibility of stale read from the *old* primary, i.e. a read from the primary that hasn't figured out to step down yet, *after* some new data has been written on the "new" primary. Hope that clarifies things rather than muddying them more :)

Asya

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/18a65908-7c6d-4be4-994e-81cc79ec9172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
MongoDB World is back! June 1-2 in NYC. Use code ASYA for 25% off!

s.molinari

unread,

Apr 24, 2015, 8:32:28 AM4/24/15

to mongod...@googlegroups.com

Thanks Asya.

I think I understand. But just to be sure. :-)

The issue is now...

When you have a replica set and when a network partition splits your replica set nodes and until a new primary is elected...

...then.....

a client could read stale data from the old primary, despite a write concern set to majority.

And when 3.2 comes out, the stale reads will only happen when the new primary is running, before the old primary has stepped down.

Is that now more correct?

Scott

Asya Kamsky

unread,

Apr 24, 2015, 10:23:08 AM4/24/15

to mongod...@googlegroups.com

Nope.

Think of it this way - stale reads from primary are not possible if there is ever at most only one primary at a time.

This is because you cannot have stale data until someone successfully writes "newer" version of that data. And to "write majority", you _must_ be talking to a "true" primary. So to read stale data you must have a write-read sequence that goes (in order) new primary-old primary.

Assuming that you're with me so far, you may be asking in what scenario could you write to new primary and read from old primary - it requires that two primaries exist and new primary started accepting writes, but old primary has not yet realized that it cannot see majority and therefore must step down. It also requires "you" to flip-flop between old then new then old primary again. Why would that happen? It requires network partition of the kind where both old and new primary cannot "see" each other, but "you" can see them both.

The reason I put "you" in quotes is because "you" does not represent a point on the network or a socket or a connection to the replica set, it represents your application which can be large and complex and it goes through the driver which has connection pools and which is monitoring the replica set members but of course again the observations and actions about them are not instantaneous. So two different operations could go over two connections or even two drivers if your application comprises many parts that are actually all smaller separate applications - all of that makes it easier to understand how this can happen, but of course that's not required - a single application with more than one connection can create this edge case scenario (you would just have to try harder :) which is what Kyle's test suites do - they try to test exactly these scenarios by creating the right conditions for them, but of course they happen in real life all on their own).

Asya

On Friday, April 24, 2015, s.molinari <scottam...@googlemail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/8b7da0b9-db33-4290-9e53-cad5879e71ef%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

MARK CALLAGHAN

unread,

Apr 24, 2015, 11:39:41 AM4/24/15

to mongod...@googlegroups.com

I think this discussion can be much simpler.

1) With mmapv1 changes are visible on the master before the journal is forced to disk. That is not true for WiredTiger and RocksDB.

2) With any engine changes can be visible on the master before the oplog entries are sent to the slave. This is what you get with async replication and the majority write concern has no impact on the behavior.

---

This sequence can happen:
* I commit change on master
* you read my change on master
* master disappears before oplog entry from my change sent to any slave
* my change disappears as new master doesn't have it, you read a change that no longer exists

---

I suggested this race existed last year but was ignoring replication at the time and didn't push on that part of it:
http://smalldatum.blogspot.com/2014/03/when-does-mongodb-make-transaction.html

I continue to think this is a doc bug:
http://smalldatum.blogspot.com/2015/04/much-ado-about-nothing.html

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAOe6dJDdfSjxQNW1Wtz%3DvZAKUEA5KowFSTvjRWM%2BR_q7kOL3pA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Mark Callaghan
mdca...@gmail.com

Asya Kamsky

unread,

Apr 24, 2015, 12:40:41 PM4/24/15

to mongodb-user

Mark,

I agree that yours is a great, concise summary of the read-uncommitted
issue (which is what the initial discussion was about). 3.2 will add
ability to only read replica-committed data.

However, Kyle's point in the server ticket was that write-majority and
read-replica-committed alone is not enough to guarantee that stale
reads cannot happen still (which is a required condition for
linearizable systems) and that's the scenario I was trying to explain
to Scott (since it seemed in his original question he was already
aware that data read on primary may be missing from secondary after
failover).

Asya

> https://groups.google.com/d/msgid/mongodb-user/CAFbpF8NbGsUAwNgVZ3%2B6OSK%2BfA_RzAtQtP-SJKZ57gFS9e4s0w%40mail.gmail.com.

s.molinari

unread,

Apr 25, 2015, 2:27:29 AM4/25/15

to mongod...@googlegroups.com

The following might either show I am still confused or actually understand the issue.

The issue boils down to latency in the network and as Mark points out, the asynchronous replication or optimistic replication Mongo uses. Because Mongo is distributed, things happen at different times, like writing data to different replica nodes so it is "safe" and telling other nodes and the clients there is a new master, etc. Because of this mixed timing, system messages, like "I am now master" and data activity messages like "I am now saved on this node" get sent in certain unwanted scenarios like a network partition at conflicting times, which can cause inconsistencies in the data and the system.

Currently, MongoDB allows reads of data that don't have the flag "I am now saved on a majority of nodes" and because of the network timing, a client could end up reading from two primaries, if a replica set is split by a network partition. If the new primary also has updated data, this could mean reads from the old primary are then stale.

So, basically, the new read-committed change coming in 3.2 is removing the issue that a primary is not concerned with the "I am now saved in a majority of nodes" message, and allows data not acknowledged as "majority saved" to be read by the clients. In other words, the change won't allow reads of a document to take place in the current/ old primary, until the logic "This data is now saved on a majority of nodes, you may continue with reads" is true. This change should help with the stale data and lost data issues in the case of a network partition, because for the write, if the confirmation for the majority saved isn't there, then the application can't move forward (the way it is now) and as long as the confirmation isn't there, no other clients can read that new data (which currently isn't the case and should be changed). It is basically a "document read lock for write concern majority operations".

This also means a possible performance hit for reads of any such mission critical kind of data being written to a majority of replica nodes, where such a majority write concern would be needed.

Phew....:-D

Scott

Asya Kamsky

unread,

Apr 25, 2015, 8:15:40 PM4/25/15

to mongodb-user

Scott,

This is really close, except for the last part about trying to read from old primary - read committed won't prevent reading stale data from old primary - you could in fact read the last piece of data that old primary knows was committed to the majority of the cluster. However, this may be *older* than the latest data on the newly elected primary. The fully correct solution is to have a stronger read consistency option available where you can say "I only want to read from you if you verify that you are still the current primary" - that's what already can happen if you use a write-in-place-of-read with majority write concern - if you succeed replicating this write to the majority of the cluster, then the read value is guaranteed to be "current" at the time you asked for it (of course it might not be by the time you receive it over the network, but that's inevitable and expected).

Asya

--

You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/61819e5a-374e-4546-b29b-5f2aa115b66b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

s.molinari

unread,

Apr 26, 2015, 2:55:16 AM4/26/15

to mongod...@googlegroups.com

Now I have it. I think.:o LOL! Thank you for your patience and help.

So, what is the maximum amount of time, worst case, where a client could read stale data, when a network partition happens and a new master is elected? How rare is this scenario?

Can stale reads be defended against at application level? If yes, how best to do it?

Scott

Asya Kamsky

unread,

Apr 26, 2015, 8:37:38 AM4/26/15

to mongodb-user

Scott:

> what is the maximum amount of time

In practical terms, it will depend on the version of mongod and the driver. Elections, step-down, these things are tied to various changes we've made over versions. We've been working toward shorter failover times and better election protocol which will allow the drivers to recognize immediately in face of two primaries which one is "legit". So if using next version with read committed and latest driver and you are talking about a single application (single connection pool really, where the driver can invalidate old primary on all connections as soon as it sees a higher election id) then it could probably be reduced to close to no time - but that may not be scenario for large complex applications - driver cannot invalidate *other* connections to old primary, only *it* can do that, based on realizing it's not primary. [2] So in that case we may be talking about seconds.

> How rare is this scenario

Hard to say. Kyle has an excellent post on " how rare are network partitions" and I'm inclined to point you at it, as it's a pretty fun read. [1]. Having said that, I'm not sure that it's correct to limit your thinking to network partitions only. Two primaries can transiently happen any time there is an election by majority when the existing primary has not stepped down. This can happen when the network truly partitions off part of the replica set, but it can also happen when primary if not responding to heartbeats for any other reason. Possible reasons: its server is overloaded, its server is frozen, someone mis-configured the network switch asymmetrically (no one in the replica set can see the primary but it can see everyone), someone accidentally sent SIGTSTP to primary process, etc.

> how best to [ defend against it on application level ]

In certain scenarios you use findAndModify to do your reads - what you already do when implementing a queue or "global" sequential counter. As long as you wait to use the value returned till majority write concern was satisfied you would be guaranteed that this primary gave you a non-stale value.

I'm sure there will be others in the future - read-as-write can be used with any version.

A non-workaround that nonetheless "works" would be to consider if stale read is more important to prevent for a particular application than having high availability. If you think yes, then you can disable automatic fail over (and assume that manual fail over will never do the wrong thing which I personally would not). I'm inclined to think that given users are always asking for faster failover, lower latency reads/writes, multimaster, etc, since those are all more likely to create weaker consistency, either those users' applications are not concerned with stale reads, or . . .

Asya

[1] https://aphyr.com/posts/288-the-network-is-reliable
[2] maybe someone can figure out a way any driver that sees a higher election id can tell the old primary it needs to check if it's really still primary. Like, if addition to isMaster it can send command like isMasterStill, ohReally, areYouSureIsMaster, or doubleCheckPlease :)

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/33b7da9a-ce2c-41bb-b9b1-353e5db37bc5%40googlegroups.com.

s.molinari

unread,

Apr 26, 2015, 2:23:25 PM4/26/15

to mongod...@googlegroups.com

Thanks Asya.

One last question and it goes along the lines of your last footnote. Can't a client tell when it is connected to two primaries? I would think the client should then basically balk and hold up its service, until one steps down.

Scott

Asya Kamsky

unread,

Apr 27, 2015, 1:36:16 AM4/27/15

to mongodb-user

Scott,

In a distributed system there is no one single client.

Asya

--

You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/2df06acc-c0e8-4576-b44c-e553ba36df63%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

s.molinari

unread,

Apr 27, 2015, 3:36:38 AM4/27/15

to mongod...@googlegroups.com

Yes, I was thinking along some other lines (and bear with me, should my line of thinking be totally stupid and show my further ignorance).

Why couldn't a client become the "master" of the system. If the client can directly tell that a new primary is trying to serve its queries, even with that first query, then it should go "Whoa, this is not good! I am now your master!" and it should basically completely stop all operations, everywhere, and like you said, start asking around, "Are we really ready to do business as usual?"

I am also thinking the clients would also need to be able to talk to each other in some way and know their presence, for such a "master client" scenario to work properly, when such a network partition occurs. The first client to notice there is a new primary takes on the "master duties" and does any further coordination for the whole system. Once it has acquired these master duties, it is simply a race to start shutting down the other clients and the cluster nodes (from reading and writing), while at the same time, health checking the cluster for a proper single master.

My thinking is, the "dumb" clients are the main reason for the larger risk of inconsistent system state. Give them some "smarts" to avoid this a bit better and at least we've reduced the risk of inconsistencies down to milliseconds. Yes, we've also got our users being denied service, but better denied service than a totally messed up service. Theoretically, the system should only be "down", until the old primary steps down and all the clients have been updated to that fact. The stepping down process could actually be sped up, if the "master client" has access to the old primary too.

If both partitioned sides have viable masters, the smart client could also even force the complete shutdown of a whole partition.

Of course, this state of recovery also means that the smart client becomes a single point of failure for everything thing. But, we are down to milliseconds, a reasonable risk to take for better consistency?

If my thinking is correct (and I highly doubt it is), a smart client system could be really handy, if the network partition had split the clients too. Because, since the clients were aware of the other client's presence, the two master clients would also know, they cannot resume any operations, until all the clients get back "together". If they do get back their communications, the two master clients would have to coordinate to reinstate the best condition for the cluster as possible. If they never do, then I would think that is when manual interaction would be necessary to make the call as to which partition is able to go back into operation.

Is this totally ridiculous? I mean, the problems of a distributed system comes down to how smart the participants can handle any anomalies. Since MongoDB's clients are pretty much dumb to these things, this is the reason for too many possible inconsistencies, isn't it?

Scott

s.molinari

unread,

Apr 27, 2015, 3:38:39 AM4/27/15

to mongod...@googlegroups.com

Crap, I meant single primary in the third paragraph. ( I hate you that you can't edit in Google Groups!) U_U

Scott

s.molinari

unread,

Apr 27, 2015, 3:42:06 AM4/27/15

to mongod...@googlegroups.com

Doh, I don't hate you! I hate Google Groups. LOL!

Scott

Asya Kamsky

unread,

Apr 27, 2015, 7:43:08 PM4/27/15

to mongodb-user

Scott,

This is getting a little long and probably a little bit into the weeds so this is my last comment on the thread, but it basically comes down to this:

You are trying to introduce a single coordinator into the distributed system. If you have that, then you have better chance for strong consistency but much lower chance for availability. A long time ago, my mom used to tell me that one person cannot be in two places at the same time (1) and so it goes with distributed systems. The same information cannot be transmitted instantaneously to all parties in a transaction ... unless you force every single transaction to go through a single point in which case of course it will become a single point of failure. :)

Asya

(1) it was a significantly less polite sounding expression :)

--

You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/38bde5f8-22db-4fad-9477-e411f72a422f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

s.molinari

unread,

Apr 27, 2015, 11:02:29 PM4/27/15

to mongod...@googlegroups.com

Thanks Asya.

I do realize the idea of a client master would be a single point of failure. But, theoretically, the condition of the single client master would be in a situation where the whole system is stopped anyway (or at least it should be) so it is in a failed state. The risk would be leaving the future health of the system "in the hands" of that single master client. I'd rather have the system stopped in a fairly consistent state than have it go on for seconds in an ever increasing inconsistent state.