Initial sync fails

75 views
Skip to first unread message

Travis Beauvais

unread,
Oct 30, 2012, 6:51:38 PM10/30/12
to mongod...@googlegroups.com
I have a 4 shard cluster. Each shard has about 120 million document. I am trying to add a secondaries to each shard. Currently these shards only have a primary and an arbiters. I added the secondary and it started replicating. It gets to about 90 million and then stops. Here are the last few lines from the secondary's log file.

Tue Oct 30 16:54:19 [rsSync] clone database.collection 90171263
Tue Oct 30 16:54:28 [journal] old journal file will be removed: /data/mongo/journal/j._8
Tue Oct 30 16:54:31 [rsSync] 91567890 objects cloned so far from collection database.collection
Tue Oct 30 16:54:44 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:database.system.indexes query:{ expireAfterSeconds: { $exists: true } }
Tue Oct 30 16:54:44 [TTLMonitor] problem detected during query over database.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 }
Tue Oct 30 16:54:44 [TTLMonitor] ERROR: error processing ttl for db: database 10065 invalid parameter: expected an object ()
Tue Oct 30 16:55:16 [FileAllocator] allocating new datafile /data/mongo/database.17, filling with zeroes...
Tue Oct 30 16:55:16 [FileAllocator] done allocating datafile /data/mongo/database.17, size: 2047MB,  took 0.001 secs
Tue Oct 30 16:55:19 [rsSync] clone database.collection 96281215
Tue Oct 30 16:57:54 [conn43] command admin.$cmd command: { writebacklisten: ObjectId('508eee7e1f1d34c3d03490b6') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Tue Oct 30 17:02:54 [conn43] command admin.$cmd command: { writebacklisten: ObjectId('508eee7e1f1d34c3d03490b6') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms


This is only happening on 2 of the 4 shards when I add the secondary. The writebacklisten line comes up in both the failing secondaries and it has the same ObjectId() in the line. I don't know if that means anything or not. Also note that there is no TTL set for this collection.


Here is the rs.status() from the primary after it begins failing. Prior to failing the stateStr is "STARTUP2" with a state of 5.

rs.status()
{
"set" : "cluster3",
"date" : ISODate("2012-10-30T22:03:05Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 706313,
"optime" : Timestamp(1351633318000, 1),
"optimeDate" : ISODate("2012-10-30T21:41:58Z"),
"self" : true
},
{
"_id" : 1,
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 1339,
"lastHeartbeat" : ISODate("2012-10-30T22:03:04Z"),
"pingMs" : 2
},
{
"_id" : 2,
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-10-30T21:56:01Z"),
"pingMs" : 0,
"errmsg" : "DBClientBase::findN: transport error: secondary.domain.com:27017 ns: admin.$cmd query: { replSetHeartbeat: \"cluster3\", v: 76980, pv: 1, checkEmpty: false, from: \"primary.domain.com:27017\", $auth: {} }"
}
],
"ok" : 1
}

Travis Beauvais

unread,
Oct 30, 2012, 8:30:22 PM10/30/12
to mongod...@googlegroups.com
One of the secondaries started replicating again. Going slow as hell though. It is doing at best a couple thousand per minute whereas when it first started it was doing a few million.

The other once is stuck in the STARTUP2/5 state. The only things in the log files for this secondary are about TTL monitor (again, there are no TTL indexes set).

David Hows

unread,
Nov 19, 2012, 1:04:29 AM11/19/12
to mongod...@googlegroups.com
Hi Travis,

How is the load on your primary?

Those errors in the log look to relate to queries being directed to the new secondary.

Has replication continued? Have any of your nodes eventually reached stable?

Cheers,

David
Reply all
Reply to author
Forward
0 new messages