________________________________________
From: jhoff [je...@hoffmeisters.com]
Sent: Monday, July 25, 2011 3:26 PM
To: mongodb-user
Subject: replication stuck?
curious how we got into this state tho.
and my comment about the dots in the keys may not be correct - not sure this was the cause.
________________________________________
From: Jerry Hoffmeister
Sent: Monday, July 25, 2011 3:53 PM
To: Jerry Hoffmeister; mongodb-user
Subject: RE: replication stuck?
Also, what version are you running? (good to always include the server
platform and version on every post)
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
Thanks for the response and yes, in the future I'll include the version info. I'm running 1.8.1 on Ubuntu 11.04. I've attached the log from the secondary that has been up the longest.
And I assume you saw my followup posts that I was able to solve the issue by shutting down both secondaries, removing the data directories and starting them back up. What was interesting was that one of the secondaries I'd brought up from scratch earlier in the day and it saw the problem as well. Seems maybe it tried to sync from the other (bad) secondary? I can provide the log from the other secondary and of course the primary as well if it would be helpful.
Also, I'm using https://github.com/mzupan/nagios-plugin-mongodb for monitoring and it's still showing high flush times for all three replicas as well as high replication lag on the primary (high meaning really high:
CRITICAL - Max replication lag: 639036 [10.2.31.239 lag=639036: 10.92.243.66 lag=639036: 10.2.31.239 lag=0: 10.122.161.138 lag=0: ]
The first and last IP above are my two secondaries. Not sure what the middle two are. I haven't had time to dig into it much yet. Oh and the replication lag number just keeps climbing. But rs.status(); shows that all replicas are in sync now:
OggiLogging:PRIMARY> rs.status();
{
"set" : "OggiLogging",
"date" : ISODate("2011-07-26T20:39:46Z"),
"myState" : 1,
"members" : [
{
"_id" : 3,
"name" : "domU-12-31-39-09-98-99.compute-1.internal:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 74518,
"optime" : {
"t" : 0,
"i" : 0
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
},
{
"_id" : 4,
"name" : "ip-10-2-31-239.ec2.internal:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 74425,
"optime" : {
"t" : 1311712785000,
"i" : 912
},
"optimeDate" : ISODate("2011-07-26T20:39:45Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
},
{
"_id" : 5,
"name" : "domU-12-31-39-01-62-12.compute-1.internal:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1311712786000,
"i" : 326
},
"optimeDate" : ISODate("2011-07-26T20:39:46Z"),
"self" : true
},
{
"_id" : 7,
"name" : "ip-10-122-161-138.ec2.internal:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 73273,
"optime" : {
"t" : 1311712785000,
"i" : 566
},
"optimeDate" : ISODate("2011-07-26T20:39:45Z"),
"lastHeartbeat" : ISODate("2011-07-26T20:39:45Z")
}
],
"ok" : 1
}
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Scott Hernandez [scotthe...@gmail.com]
Sent: Monday, July 25, 2011 6:03 PM
To: mongod...@googlegroups.com
Subject: Re: [mongodb-user] replication stuck?
Tue Jul 19 10:56:16 [replica set sync] replSet syncThread: 10154 Modifiers and non-modifiers cannot be mixed
And it did sync from one of the secondaries - in fact the one I replaced it with:
Mon Jul 25 20:20:32 [replica set sync] replSet syncing to: ip-10-92-243-66.ec2.internal:27017 (you'll see it's not listed in the rs.status() below).
Interesting, the Nagios plugin ONLY reports replication lag for the primary - for the secondaries, the line is:
"OK - This is a slave."
rs.status() definitely shows that they are in sync... I will write to the plugin author and see if maybe the plugin is out of date...
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 2:24 PM
To: mongodb-user
Subject: [mongodb-user] Re: replication stuck?
I don't know for sure that this was the cause of the replication getting stuck - just that the same error message was being generated. Would the primary's log help?
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 2:47 PM
def check_rep_lag(host, port, warning, critical):
try:
con = pymongo.Connection(host, port, slave_okay=True)
isMasterStatus = con.admin.command("ismaster", "1")
if not isMasterStatus['ismaster']:
print "OK - This is a slave."
sys.exit(0)
masterOpLog = con.local['oplog.rs']
lastMasterOpTime = masterOpLog.find_one(sort=[('$natural', -1)])['ts'].time
slaves = con.local.slaves.find()
data = ";"
lag = 0
for slave in slaves:
lastSlaveOpTime = slave['syncedTo'].time
replicationLag = lastMasterOpTime - lastSlaveOpTime
data = data + slave["host"] + " lag=" + str(replicationLag) + "; "
lag = max(lag, replicationLag)
data = data[1:len(data)]
if lag >= critical:
print "CRITICAL - Max replication lag: %i [%s]" % (lag, data)
sys.exit(2)
elif lag >= warning:
print "WARNING - Max replication lag: %i [%s]" % (lag, data)
sys.exit(1)
else:
print "OK - Max replication lag: %i [%s]" % (lag, data)
sys.exit(0)
except pymongo.errors.ConnectionFailure:
print "CRITICAL - Connection to MongoDB failed!"
sys.exit(2)
And here's the output of db.slaves.find() on the primary:
OggiLogging:PRIMARY> use local
switched to db local
OggiLogging:PRIMARY> db.slaves.find();
{ "_id" : ObjectId("4e03c9a83cb4be78cc097df5"), "host" : "10.2.31.239", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311073175000, "i" : 22 } }
{ "_id" : ObjectId("4e0e4a472f3f0ae2e24c5ff8"), "host" : "10.92.243.66", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311073175000, "i" : 17 } }
{ "_id" : ObjectId("4e2e0280fe83fddb100dcf6f"), "host" : "10.2.31.239", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311724093000, "i" : 535 } }
{ "_id" : ObjectId("4e2e070031dc52ae9a3a70bf"), "host" : "10.122.161.138", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1311724093000, "i" : 535 } }
OggiLogging:PRIMARY>
The first and third entries are the same IP and the second entry is for an old host. Do I have some stale data in the local db?
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 4:29 PM
It's actually:
--
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 5:11 PM
--
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, July 26, 2011 5:20 PM
--
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, August 02, 2011 12:43 PM
To: mongodb-user
Subject: [mongodb-user] Re: replication stuck?
To hack around:
// remember this result
--
________________________________________
From: mongod...@googlegroups.com [mongod...@googlegroups.com] On Behalf Of Kristina Chodorow [k.cho...@gmail.com]
Sent: Tuesday, August 02, 2011 3:10 PM