replset.minvalid

624 views
Skip to first unread message

Gambitg

unread,
May 20, 2011, 2:10:00 PM5/20/11
to mongodb-user
The documentation says:
"local.replset.minvalid sometimes contains an object used internally
by replica sets to track sync status"

Why is it only 'sometimes' ? Is this not the only collection that a
secondary reads to pick-up its read of Master's OpLog at the right
place ?


On a separate note, does the secondary first write to its opLog or
first to the dataset ?

k

unread,
May 20, 2011, 7:29:36 PM5/20/11
to mongodb-user
The minvalid doc is used for initial syncs, so servers that have never
done an initial sync and arbiters don't have a minvalid. Once a
server is running, the latest entry in its oplog (oplog.rs) is where
its synced to, so it doesn't need the minvalid to keep track.

> On a separate note, does the secondary first write to its opLog or
> first to the dataset ?

I assume you mean on initial sync? The secondary does the following
steps:
1) allocates an oplog, but doesn't put anything in it.
2) It takes a note of what the primary's latest op is and saves it in
minvalid.
3) It copies over all of the data from whoever it's syncing from.
4) Then it starts applying the oplog from the minvalid value onward.

Gambitg

unread,
May 23, 2011, 11:22:51 AM5/23/11
to mongodb-user
> Once a server is running, the latest entry in its oplog (oplog.rs) is where
> its synced to, so it doesn't need the minvalid to keep track.

I was under the impression that the "ts" field in oplog.rs of
secondary contains the timestamp of when the record is written to it.
But after looking at some tests on a cluster (with different clock on
different nodes), I see that ts in oplog.rs on secondary contains the
ts value copied over from the oplog.rs of the primary.
Now that you mention that the latest entry in oplog of secondary is
used to sync, it makes sense.

But I think we might land into a data-loss scenario. Say we have 3
nodes(A,B and C), and the 2 secondaries(B&C) have clock that is behind
the primary(A) by say 10 minutes. If A goes down at time t (on its
node), one of the secondary (say B) will become the primary and start
writing to its oplog with its clock that is 10 minutes behind. So
records in its oplog will not be in increasing timestamp order. Node C
will start reading from the oplog of B. But it will try to pickup
records >= the latest ts in its oplog. So records that get written to
B in about first 10 minutes of it being primary will not be caught up
by C. Right ?
C will keep looking for records >= t and new records on B with ts
between ts-10 and ts will not get captured by C.

k

unread,
May 23, 2011, 1:46:26 PM5/23/11
to mongodb-user
Replica sets have some basic protection against clock skew in the
code. We once had a user with 30-year clock skew, which caused some
weird effects, but as long as your clocks are reasonably in sync,
MongoDB keeps track of the members' skew and should compensate.

Gambitg

unread,
May 23, 2011, 4:22:16 PM5/23/11
to mongodb-user
I was able to easily setup and execute a scenario where reasonable
clock skew (10 minute difference) was not compensated (i.e. data loss
occurs):

1. Setup of A, B and C nodes. Each of B and C are about 10 minutes
behind A.
// On A
>date returns Mon May 23 19:54:34 UTC 2011
// On B
>date return Mon May 23 19:43:33 UTC 2011

2. Cluster runs with A as primary. Insert a few records as follows:
db.foo.insert({x:1})
db.foo.insert({x:2})
// On A
rs1:PRIMARY> db.foo.find()
{ "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 }
{ "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 }
rs1:PRIMARY> use local
switched to db local
rs1:PRIMARY> db.oplog.rs.find()
{ "ts" : { "t" : 1306180706000, "i" : 1 }, "h" : NumberLong(0), "op" :
"n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1306180747000, "i" : 1 }, "h" :
NumberLong("3136340678959693841"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 } }
{ "ts" : { "t" : 1306180751000, "i" : 1 }, "h" :
NumberLong("-3548970921664408588"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 } }

B and C returns the exact same output for both foo and oplog.rs
collections.

3. Now terminate node A using db.shutdownServer(). B gets promoted as
primary. Insert a few records again now (on B):
db.foo.insert({y:1})
db.foo.insert({y:2})

// On B
rs1:PRIMARY> db.foo.find()
{ "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 }
{ "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 }
{ "_id" : ObjectId("4ddabb630cc90e0d72223b68"), "y" : 1 }
{ "_id" : ObjectId("4ddabb680cc90e0d72223b69"), "y" : 2 }
rs1:PRIMARY> use local
switched to db local
rs1:PRIMARY> db.oplog.rs.find()
{ "ts" : { "t" : 1306180706000, "i" : 1 }, "h" : NumberLong(0), "op" :
"n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1306180747000, "i" : 1 }, "h" :
NumberLong("3136340678959693841"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 } }
{ "ts" : { "t" : 1306180751000, "i" : 1 }, "h" :
NumberLong("-3548970921664408588"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 } }
{ "ts" : { "t" : 1306180451000, "i" : 1 }, "h" :
NumberLong("-5215459932265573458"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabb630cc90e0d72223b68"), "y" : 1 } }
{ "ts" : { "t" : 1306180456000, "i" : 1 }, "h" :
NumberLong("-8690927360367643972"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabb680cc90e0d72223b69"), "y" : 2 } }

You can see that the ts are not in increasing order anymore. The ts of
{y:1} insert is less than that of {x:2} insert.

C does not register these new inserts at all.
// On C
rs1:SECONDARY> db.foo.find()
{ "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 }
{ "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 }
rs1:SECONDARY> db.oplog.rs.find()
{ "ts" : { "t" : 1306180706000, "i" : 1 }, "h" : NumberLong(0), "op" :
"n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1306180747000, "i" : 1 }, "h" :
NumberLong("3136340678959693841"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 } }
{ "ts" : { "t" : 1306180751000, "i" : 1 }, "h" :
NumberLong("-3548970921664408588"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 } }
No change in both these collections on C even though writes are being
made on PRIMARY B and C is in SECONDARY status

4. //wait for more than 10 minutes.
Now insert one more record in B
db.foo.insert({y:3})

rs1:PRIMARY> db.oplog.rs.find()
{ "ts" : { "t" : 1306176176000, "i" : 1 }, "h" : NumberLong(0), "op" :
"n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1306176382000, "i" : 1 }, "h" :
NumberLong("3136021970911494161"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddaab7e77a574b12fbed4f7"), "x" : 1 } }
{ "ts" : { "t" : 1306176387000, "i" : 1 }, "h" :
NumberLong("-4259052380038851596"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddaab8377a574b12fbed4f8"), "x" : 2 } }
{ "ts" : { "t" : 1306175929000, "i" : 1 }, "h" :
NumberLong("-147207564444626001"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddaa9b957bfea93870c5782"), "y" : 1 } }
{ "ts" : { "t" : 1306175935000, "i" : 1 }, "h" :
NumberLong("7346136642044836720"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddaa9bf57bfea93870c5783"), "y" : 2 } }
{ "ts" : { "t" : 1306177210000, "i" : 1 }, "h" :
NumberLong("720390880350244707"), "op" : "i", "ns" : "test.foo", "o" :
{ "_id" : ObjectId("4ddaaeba57bfea93870c5784"), "y" : 3 }

// On C
rs1:SECONDARY> db.foo.find()
{ "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 }
{ "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 }
{ "_id" : ObjectId("4ddabdd50cc90e0d72223b6a"), "y" : 3 }
rs1:SECONDARY> use local
switched to db local
rs1:SECONDARY> db.oplog.rs.find()
{ "ts" : { "t" : 1306180706000, "i" : 1 }, "h" : NumberLong(0), "op" :
"n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1306180747000, "i" : 1 }, "h" :
NumberLong("3136340678959693841"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8b3fa543a5e0e33f55"), "x" : 1 } }
{ "ts" : { "t" : 1306180751000, "i" : 1 }, "h" :
NumberLong("-3548970921664408588"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabc8f3fa543a5e0e33f56"), "x" : 2 } }
{ "ts" : { "t" : 1306181077000, "i" : 1 }, "h" :
NumberLong("-924333443697256058"), "op" : "i", "ns" : "test.foo",
"o" : { "_id" : ObjectId("4ddabdd50cc90e0d72223b6a"), "y" : 3 } }
rs1:SECONDARY>

You can see that C is missing the records {y:1} and {y:2} but {y:3}
which was inserted in B after a lag of about 10 minutes of it becoming
primary made to C.

k

unread,
May 23, 2011, 4:31:56 PM5/23/11
to mongodb-user
I've created a bug for this: https://jira.mongodb.org/browse/SERVER-3132

It definitely should handle that type of skew!

Gambitg

unread,
May 26, 2011, 3:27:18 PM5/26/11
to mongodb-user
Surprised that no one has run into this so far.
Release notes of 1.4 mentions 'replication handles clock skew on
master'. I would have guessed a scenario like this would have been
covered.
Reply all
Reply to author
Forward
0 new messages