replication stuck?

220 views
Skip to first unread message

jhoff

unread,
Jul 25, 2011, 6:26:53 PM7/25/11
to mongodb-user
I'm seeing this message in the log of all of my secondaries:

Mon Jul 25 22:18:14 [replica set sync] replSet syncThread: 10154
Modifiers and non-modifiers cannot be mixed

And rs.status(); shows that all of them are way behind:

{
"_id" : 4,
"name" : "<internal-name-1>:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1311073175000,
"i" : 22
},
"optimeDate" :
ISODate("2011-07-19T10:59:35Z"),
"self" : true
},
{
"_id" : 5,
"name" : "<internal-name-2>:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 7442,
"optime" : {
"t" : 1311632318000,
"i" : 480
},
"optimeDate" :
ISODate("2011-07-25T22:18:38Z"),
"lastHeartbeat" :
ISODate("2011-07-25T22:24:08Z")
},

We had an issue where our keys contained dots which we think caused
this issue. It's been resolved (we removed the dots) and the primary
appears to be fine but replication isn't happening to the
secondaries. How can I fix this?

Thanks,
Jerry

Jerry Hoffmeister

unread,
Jul 25, 2011, 6:53:07 PM7/25/11
to Jerry Hoffmeister, mongodb-user
also, my monitoring for average flush time on the Primary went from around 50ms to approx 400ms and the monitoring for lag has of course alarmed as well.

________________________________________
From: jhoff [je...@hoffmeisters.com]
Sent: Monday, July 25, 2011 3:26 PM
To: mongodb-user
Subject: replication stuck?

Jerry Hoffmeister

unread,
Jul 25, 2011, 8:52:50 PM7/25/11
to Jerry Hoffmeister, mongodb-user
we "fixed" this by stopping both secondaries, deleting the db files and restarting them.

curious how we got into this state tho.

and my comment about the dots in the keys may not be correct - not sure this was the cause.

________________________________________
From: Jerry Hoffmeister
Sent: Monday, July 25, 2011 3:53 PM
To: Jerry Hoffmeister; mongodb-user
Subject: RE: replication stuck?

Scott Hernandez

unread,
Jul 25, 2011, 9:03:02 PM7/25/11
to mongod...@googlegroups.com
Can you post the logs from one of secondaries?

Also, what version are you running? (good to always include the server
platform and version on every post)

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Jerry Hoffmeister

unread,
Jul 26, 2011, 4:36:36 PM7/26/11
to mongod...@googlegroups.com
Hi Scott,

Thanks for the response and yes, in the future I'll include the version info. I'm running 1.8.1 on Ubuntu 11.04. I've attached the log from the secondary that has been up the longest.

And I assume you saw my followup posts that I was able to solve the issue by shutting down both secondaries, removing the data directories and starting them back up. What was interesting was that one of the secondaries I'd brought up from scratch earlier in the day and it saw the problem as well. Seems maybe it tried to sync from the other (bad) secondary? I can provide the log from the other secondary and of course the primary as well if it would be helpful.

Also, I'm using https://github.com/mzupan/nagios-plugin-mongodb for monitoring and it's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:

CRITICAL - Max replication lag: 639036 [10.2.31.239 lag=639036: 10.92.243.66 lag=639036: 10.2.31.239 lag=0: 10.122.161.138 lag=0: ]

The first and last IP above are my two secondaries. Not sure what the middle two are. I haven't had time to dig into it much yet. Oh and the replication lag number just keeps climbing. But rs.status(); shows that all replicas are in sync now:

OggiLogging:PRIMARY> rs.status();
{
"set" : "OggiLogging",
"date" : ISODate("2011-07-26T20:39:46Z"),
"myState" : 1,
"members" : [
{
"_id" : 3,
"name" : "domU-12-31-39-09-98-99.compute-1.internal:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 74518,
"optime" : {
"t" : 0,
"i" : 0
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
},
{
"_id" : 4,
"name" : "ip-10-2-31-239.ec2.internal:27017",


"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",

"uptime" : 74425,
"optime" : {
"t" : 1311712785000,
"i" : 912
},
"optimeDate" : ISODate("2011-07-26T20:39:45Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
},
{
"_id" : 5,
"name" : "domU-12-31-39-01-62-12.compute-1.internal:27017",


"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",

"optime" : {
"t" : 1311712786000,
"i" : 326
},
"optimeDate" : ISODate("2011-07-26T20:39:46Z"),
"self" : true
},
{
"_id" : 7,
"name" : "ip-10-122-161-138.ec2.internal:27017",


"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",

"uptime" : 73273,
"optime" : {
"t" : 1311712785000,
"i" : 566
},
"optimeDate" : ISODate("2011-07-26T20:39:45Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
}
],
"ok" : 1
}

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Scott Hernandez [scotthe...@gmail.com]
Sent: Monday, July 25, 2011 6:03 PM
To: mongod...@googlegroups.com
Subject: Re: [mongodb-user] replication stuck?

mongodb.zip

Kristina Chodorow

unread,
Jul 26, 2011, 5:24:35 PM7/26/11
to mongodb-user
The primary can't have replication lag, that's only for secondaries.
Do you know what Nagios is measuring for that? rs.status() has the
definitive info about lag.

I didn't see any of the sync errors in the log attached. Sounds like
the dots in key names were probably the culprit, but I could probably
tell from the log.

Secondaries will have a line, "syncing to: <host>" if you want to see
who they chose for their initial sync.


On Jul 26, 4:36 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Hi Scott,
>
> Thanks for the response and yes, in the future I'll include the version info.  I'm running 1.8.1 on Ubuntu 11.04.  I've attached the log from the secondary that has been up the longest.
>
> And I assume you saw my followup posts that I was able to solve the issue by shutting down both secondaries, removing the data directories and starting them back up.  What was interesting was that one of the secondaries I'd brought up from scratch earlier in the day and it saw the problem as well.  Seems maybe it tried to sync from the other (bad) secondary?  I can provide the log from the other secondary and of course the primary as well if it would be helpful.
>
> Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbfor monitoring and it's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Scott Hernandez [scotthernan...@gmail.com]
> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
>  mongodb.zip
> 2284KViewDownload

Jerry Hoffmeister

unread,
Jul 26, 2011, 5:38:00 PM7/26/11
to mongod...@googlegroups.com, Paul Grinchenko
the errors are definitely in the log I sent - I even downloaded it from the e-mail again to be sure. Here's the first occurance:

Tue Jul 19 10:56:16 [replica set sync] replSet syncThread: 10154 Modifiers and non-modifiers cannot be mixed

And it did sync from one of the secondaries - in fact the one I replaced it with:

Mon Jul 25 20:20:32 [replica set sync] replSet syncing to: ip-10-92-243-66.ec2.internal:27017 (you'll see it's not listed in the rs.status() below).

Interesting, the Nagios plugin ONLY reports replication lag for the primary - for the secondaries, the line is:

"OK - This is a slave."

rs.status() definitely shows that they are in sync... I will write to the plugin author and see if maybe the plugin is out of date...
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 2:24 PM
To: mongodb-user
Subject: [mongodb-user] Re: replication stuck?

Kristina Chodorow

unread,
Jul 26, 2011, 5:47:09 PM7/26/11
to mongodb-user
Oh, I see, I was looking for the wrong error message. Sorry, long
day. Do you have any copy of one of the "bad" docs that was inserted?



On Jul 26, 5:38 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> the errors are definitely in the log I sent - I even downloaded it from the e-mail again to be sure.  Here's the first occurance:
>
> Tue Jul 19 10:56:16 [replica set sync] replSet syncThread: 10154 Modifiers and non-modifiers cannot be mixed
>
> And it did sync from one of the secondaries - in fact the one I replaced it with:
>
> Mon Jul 25 20:20:32 [replica set sync] replSet syncing to: ip-10-92-243-66.ec2.internal:27017 (you'll see it's not listed in the rs.status() below).
>
> Interesting, the Nagios plugin ONLY reports replication lag for the primary - for the secondaries, the line is:
>
> "OK - This is a slave."
>
> rs.status() definitely shows that they are in sync...  I will write to the plugin author and see if maybe the plugin is out of date...
> ________________________________________
> From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.chodo...@gmail.com]
> Sent: Tuesday, July 26, 2011 2:24 PM
> To: mongodb-user
> Subject: [mongodb-user] Re: replication stuck?
>
> The primary can't have replication lag, that's only for secondaries.
> Do you know what Nagios is measuring for that?  rs.status() has the
> definitive info about lag.
>
> I didn't see any of the sync errors in the log attached.  Sounds like
> the dots in key names were probably the culprit, but I could probably
> tell from the log.
>
> Secondaries will have a line, "syncing to: <host>" if you want to see
> who they chose for their initial sync.
>
> On Jul 26, 4:36 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
>
>
>
>
>
>
>
>
>
> > Hi Scott,
>
> > Thanks for the response and yes, in the future I'll include the version info.  I'm running 1.8.1 on Ubuntu 11.04.  I've attached the log from the secondary that has been up the longest.
>
> > And I assume you saw my followup posts that I was able to solve the issue by shutting down both secondaries, removing the data directories and starting them back up.  What was interesting was that one of the secondaries I'd brought up from scratch earlier in the day and it saw the problem as well.  Seems maybe it tried to sync from the other (bad) secondary?  I can provide the log from the other secondary and of course the primary as well if it would be helpful.
>
> > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoring and it's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:

Jerry Hoffmeister

unread,
Jul 26, 2011, 6:12:52 PM7/26/11
to mongod...@googlegroups.com
I asked and we don't. What I'm told is that inserts were failing for docs that had keys that had dots in their names. We fixed the code that was doing the inserts to not include dots in keynames. We're using the Java driver.

I don't know for sure that this was the cause of the replication getting stuck - just that the same error message was being generated. Would the primary's log help?

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 2:47 PM

Kristina Chodorow

unread,
Jul 26, 2011, 6:25:10 PM7/26/11
to mongodb-user
No, that's fine, I can try to reproduce from that.

It would definitely halt replication. If there's any sort of error,
the secondary will print syncThread: errorMsg and then retry syncing
that operation ad infinitum.


On Jul 26, 6:12 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> I asked and we don't.  What I'm told is that inserts were failing for docs that had keys that had dots in their names.  We fixed the code that was doing the inserts to not include dots in keynames.  We're using the Java driver.
>
> I don't know for sure that this was the cause of the replication getting stuck - just that the same error message was being generated.  Would the primary's log help?
>
> ________________________________________
> > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringand it's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> > To unsubscribe from this group, send email to...
>
> read more »

Paul Grinchenko

unread,
Jul 26, 2011, 6:44:09 PM7/26/11
to mongodb-user
Hi Kristina,

Thank you for taking time to answer our questions.

Am I right in understanding the flow:

T1: primary receives request for db.insert({ "some.key" : 10 }) - it
was actually an upsert
T2: primary saves data to the oplog
T3: primary tries to execute request and fails

T4: secondary reads oplog with this upsert
T5: secondary fails to insert and stuck

It seems that T2 should be a function of T3 success.

Would appreciate your comments,
Paul
> > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> ...
>
> read more »

Kristina Chodorow

unread,
Jul 26, 2011, 7:29:32 PM7/26/11
to mongodb-user
It's actually:

T1: primary receives request for db.insert({ "some.key" : 10 }) - I'm
guessing it was an update, not an upsert?
T2: primary tries to execute request and succeeds
T3: primary modifies the update to be an idempotent operation and
stores it in the oplog

T4: secondary reads oplog with this update
T5: secondary fails to exec the modified update and is stuck
> > > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit'sstill showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> ...
>
> read more »

Paul Grinchenko

unread,
Jul 26, 2011, 7:55:40 PM7/26/11
to mongodb-user
I am trying to get more details:

1) It was Java driver
2) You are right about update - it was something like:

db.update({ key1: value1, key2: value2 }, { $inc : { "." : 20 } },
true, false)

or

db.update({ key1: value1, key2: value2 }, { $inc : { "some.field" :
20 } }, true, false)

It was an "upsert" in a sense that { key1, key2 } record probably
didn't exist.
> > > > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit'sstillshowing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> ...
>
> read more »

Jerry Hoffmeister

unread,
Jul 26, 2011, 7:54:37 PM7/26/11
to mongod...@googlegroups.com
I believe I see the problem with the replication lag monitor. Here's the code that's executed on the nagios server for that check:

def check_rep_lag(host, port, warning, critical):
try:
con = pymongo.Connection(host, port, slave_okay=True)

isMasterStatus = con.admin.command("ismaster", "1")
if not isMasterStatus['ismaster']:
print "OK - This is a slave."
sys.exit(0)

masterOpLog = con.local['oplog.rs']
lastMasterOpTime = masterOpLog.find_one(sort=[('$natural', -1)])['ts'].time
slaves = con.local.slaves.find()
data = ";"
lag = 0
for slave in slaves:
lastSlaveOpTime = slave['syncedTo'].time
replicationLag = lastMasterOpTime - lastSlaveOpTime
data = data + slave["host"] + " lag=" + str(replicationLag) + "; "
lag = max(lag, replicationLag)
data = data[1:len(data)]
if lag >= critical:
print "CRITICAL - Max replication lag: %i [%s]" % (lag, data)
sys.exit(2)
elif lag >= warning:
print "WARNING - Max replication lag: %i [%s]" % (lag, data)
sys.exit(1)
else:
print "OK - Max replication lag: %i [%s]" % (lag, data)
sys.exit(0)


except pymongo.errors.ConnectionFailure:
print "CRITICAL - Connection to MongoDB failed!"
sys.exit(2)

And here's the output of db.slaves.find() on the primary:

OggiLogging:PRIMARY> use local
switched to db local
OggiLogging:PRIMARY> db.slaves.find();
{ "_id" : ObjectId("4e03c9a83cb4be78cc097df5"), "host" : "10.2.31.239", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311073175000, "i" : 22 } }
{ "_id" : ObjectId("4e0e4a472f3f0ae2e24c5ff8"), "host" : "10.92.243.66", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311073175000, "i" : 17 } }
{ "_id" : ObjectId("4e2e0280fe83fddb100dcf6f"), "host" : "10.2.31.239", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311724093000, "i" : 535 } }
{ "_id" : ObjectId("4e2e070031dc52ae9a3a70bf"), "host" : "10.122.161.138", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311724093000, "i" : 535 } }
OggiLogging:PRIMARY>

The first and third entries are the same IP and the second entry is for an old host. Do I have some stale data in the local db?

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 4:29 PM

It's actually:

--

Kristina Chodorow

unread,
Jul 26, 2011, 8:11:21 PM7/26/11
to mongodb-user
@Paul: thanks, that's very helpful!

@Jerry: local.slaves is sort of a pale echo of what's going on with
replication. It's never cleaned out and won't be updated if, for
example, one secondary is syncing off of another secondary.

The best way of finding how up-to-date everyone is either rs.status()
or finding the last entry in local.oplog.rs for each member and
comparing the ts times to the ts on the primary.
> > > > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit'sstillshowing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
> ...
>
> read more »

Jerry Hoffmeister

unread,
Jul 26, 2011, 8:12:44 PM7/26/11
to mongod...@googlegroups.com, Paul Grinchenko
Thanks Kristina - I think I will rewrite the monitoring plugin to use rs.status(); Until I do that, is there any issue with deleting those first two records in local.slaves?

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 5:11 PM

--

Kristina Chodorow

unread,
Jul 26, 2011, 8:20:39 PM7/26/11
to mongodb-user
Go ahead and delete any entries you want from local.slaves. They'll
be recreated as servers sync.


On Jul 26, 8:12 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Thanks Kristina - I think I will rewrite the monitoring plugin to use rs.status();  Until I do that, is there any issue with deleting those first two records in local.slaves?
>
> ________________________________________
> > > > > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit'sst...high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
>
> > > > > > > CRITICAL - Max replication lag: 639036 [10.2.31.239 lag=639036: 10.92.243.66 lag=639036: 10.2.31.239 lag=0: 10.122.161.138 lag=0: ]
>
> > > > > > > The first and last IP above are my two secondaries.  Not sure what the middle two are.  I haven't had time to dig into it much yet.  Oh and the replication lag number just keeps climbing.  But rs.status(); shows that all replicas are in sync now:
>
> > > > > > > OggiLogging:PRIMARY> rs.status();
> > > > > > > {
> > > > > > >         "set" : "OggiLogging",
> > > > > > >         "date" : ISODate("2011-07-26T20:39:46Z"),
> > > > > > >         "myState" : 1,
> > > > > > >         "members" : [
> > > > > > >                 {
> > > > > > >                         "_id" : 3,
> > > > > > >                         "name" : "domU-12-31-39-09-98-99.compute-1.internal:27017",
> > > > > > >                         "health" : 1,
> > > > > > >                         "state" : 7,
> > > > > > >                         "stateStr" : "ARBITER",
> > > > > > >                         "uptime" : 74518,
> > > > > > >                         "optime" : {
> > > > > > >                                 "t" : 0,
> > > > > > >                                 "i" : 0
> > > > > > >                         },
> > > > > > >                         "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
> > > > > > >                         "lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
>
> ...
>
> read more »

Jerry Hoffmeister

unread,
Jul 26, 2011, 8:35:46 PM7/26/11
to mongod...@googlegroups.com
Thanks - that did the trick!

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 5:20 PM

--

Kristina Chodorow

unread,
Jul 27, 2011, 10:11:55 AM7/27/11
to mongodb-user
Cool, I've created a bug for the sync problem, if you want to track:
https://jira.mongodb.org/browse/SERVER-3494.


On Jul 26, 8:35 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Thanks - that did the trick!
>
> ________________________________________
> > > > > > > > Also, I'm usinghttps://github.com/mzupan/nagios-plugin-mongodbformonitoringandit'sst...flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
>
> > > > > > > > CRITICAL - Max replication lag: 639036 [10.2.31.239 lag=639036: 10.92.243.66 lag=639036: 10.2.31.239 lag=0: 10.122.161.138 lag=0: ]
>
> > > > > > > > The first and last IP above are my two secondaries.  Not sure what the middle two are.  I haven't had time to dig into it much yet.  Oh and the replication lag number just keeps climbing.  But rs.status(); shows that all replicas are in sync now:
>
> > > > > > > > OggiLogging:PRIMARY> rs.status();
> > > > > > > > {
> > > > > > > >         "set" : "OggiLogging",
> > > > > > > >         "date" : ISODate("2011-07-26T20:39:46Z"),
> > > > > > > >         "myState" : 1,
> > > > > > > >         "members" : [...
>
> read more »

Paul Grinchenko

unread,
Jul 27, 2011, 12:19:00 PM7/27/11
to mongodb-user
Thank you, Kristina. Your help is highly appreciated.

1) Will I be able to produce comments on that JIRA?
2) It's good that we had a small DB (around 2G) and full re-sync of
the secondaries wasn't a major event, but I could imagine that around
500G this would become a real problem.

Kristina Chodorow

unread,
Jul 27, 2011, 12:27:40 PM7/27/11
to mongodb-user
1) Yes, feel free.
2) There are ways to hack around this (so you don't have to resync),
but it is a pain.


On Jul 27, 12:19 pm, Paul Grinchenko <paul.grinche...@gmail.com>
wrote:

Mark Kwan

unread,
Jul 27, 2011, 3:22:24 PM7/27/11
to mongodb-user
I ran into this problem when migrating to 1.8. I had a 1.6.5 master
and a 1.8.2 slave running for several days. When I went to switch
everything over a few days later I found the 1.8.2's had stopped
sync'ing because they were too far behind. When I looked back in the
logs I saw the problem seemed to start with the same "replSet
syncThread: 10154 Modifiers and non-modifiers cannot be mixed" error
messages reported above. Had to resync the 400GB database.

At the time I had figured it was just a 1.6.5 master / 1.8.2 slave
problem and figured that it wouldn't be a problem once I migrated
everything to 1.8.2. I never had a sync problem with 1.6.5. I guess
I better go add some Nagios checks now for 1.8.2.

jhoff

unread,
Aug 2, 2011, 1:33:33 PM8/2/11
to mongodb-user
The issue is happening again - do you want to see logs? I don't see
anything that jumps out... And should we re-sync again or you
suggested there are was to hack around it that are a pain that don't
require a re-sync? Which would you suggest and can you detail the
"hack around" steps?

Kristina Chodorow

unread,
Aug 2, 2011, 3:43:48 PM8/2/11
to mongodb-user
To hack around:

1. Connect to secondary.
2. Run:

> use local
> var last = db.oplog.rs.find().sort({natural:-1}).limit(1)

// remember this result

3. Connect to primary.
4. Run:

> use local
> // this may take a while to run
> problemDoc = db.oplog.rs.find({ts : {$gt : last.ts}}).sort({$natural:1}).limit(1)
// take this result and send it.


5. Basically, you're going to null out this operation on the primary
so that the secondary can skip it, then do another operation on the
primary so that the secondary has the same value. However, I can't
tell exactly what you have to do without knowing the operation. But,
if you just want to get replication going again and are okay with the
secondary having 1 doc with different data for a little while, run:

> // this may take a while to run
> db.oplog.rs.update({ts : problemDoc.ts}, {$set : {op : 'n'}});

Jerry Hoffmeister

unread,
Aug 2, 2011, 6:00:00 PM8/2/11
to mongod...@googlegroups.com, Paul Grinchenko
Unf we already just resync'd but next time, I will definitely try this. I was thinking something along these lines but wasn't sure how to do it.

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, August 02, 2011 12:43 PM


To: mongodb-user
Subject: [mongodb-user] Re: replication stuck?

To hack around:

// remember this result

--

Kristina Chodorow

unread,
Aug 2, 2011, 6:10:17 PM8/2/11
to mongodb-user
Please file a bug with the output of 4 if this happens again, as that
will help us debug the root cause.


On Aug 2, 6:00 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Unf we already just resync'd but next time, I will definitely try this. I was thinking something along these lines but wasn't sure how to do it.
>
> ________________________________________
> From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.chodo...@gmail.com]

Jerry Hoffmeister

unread,
Aug 2, 2011, 6:41:44 PM8/2/11
to mongod...@googlegroups.com
Would it be better to just add it to the existing JIRA: https://jira.mongodb.org/browse/SERVER-3494

________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, August 02, 2011 3:10 PM

Kristina Chodorow

unread,
Aug 2, 2011, 9:51:22 PM8/2/11
to mongodb-user
Yes, that would be best.

On Aug 2, 6:41 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Would it be better to just add it to the existing JIRA:  https://jira.mongodb.org/browse/SERVER-3494
>
> ________________________________________

jhoff

unread,
Sep 16, 2011, 3:24:25 PM9/16/11
to mongodb-user
We had this problem again last night. I tried the above "hack around"
but problemDoc came back empty...

On Aug 2, 3:00 pm, Jerry Hoffmeister <je...@hoffmeisters.com> wrote:
> Unf we already just resync'd but next time, I will definitely try this. I was thinking something along these lines but wasn't sure how to do it.
>
> ________________________________________
> From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.chodo...@gmail.com]

jhoff

unread,
Sep 19, 2011, 3:51:08 PM9/19/11
to mongodb-user
We have this issue again. I again tried the "hack around" suggested
above but problemDoc was again empty.....

Would like to wait but need to get our db back up and running again so
going to do the re-sync.

If there is something else I could do next time, please let me know
here.

Kristina Chodorow

unread,
Sep 19, 2011, 4:12:39 PM9/19/11
to mongodb-user
Looking back on what I put before, it's actually:

var last = db.oplog.rs.find().sort({natural:-1}).limit(1).next()
...
problemDoc = db.oplog.rs.find({ts : {$gt : last.ts}}).sort({$natural:
1}).limit(1).next()

Otherwise you're dealing with a cursor, not a single document.
Reply all
Reply to author
Forward
0 new messages