I have a 4 shard cluster. Each shard has about 120 million document. I am trying to add a secondaries to each shard. Currently these shards only have a primary and an arbiters. I added the secondary and it started replicating. It gets to about 90 million and then stops. Here are the last few lines from the secondary's log file.
Tue Oct 30 16:54:19 [rsSync] clone database.collection 90171263
Tue Oct 30 16:54:28 [journal] old journal file will be removed: /data/mongo/journal/j._8
Tue Oct 30 16:54:31 [rsSync] 91567890 objects cloned so far from collection database.collection
Tue Oct 30 16:54:44 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:database.system.indexes query:{ expireAfterSeconds: { $exists: true } }
Tue Oct 30 16:54:44 [TTLMonitor] problem detected during query over database.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 }
Tue Oct 30 16:54:44 [TTLMonitor] ERROR: error processing ttl for db: database 10065 invalid parameter: expected an object ()
Tue Oct 30 16:55:16 [FileAllocator] allocating new datafile /data/mongo/database.17, filling with zeroes...
Tue Oct 30 16:55:16 [FileAllocator] done allocating datafile /data/mongo/database.17, size: 2047MB, took 0.001 secs
Tue Oct 30 16:55:19 [rsSync] clone database.collection 96281215
Tue Oct 30 16:57:54 [conn43] command admin.$cmd command: { writebacklisten: ObjectId('508eee7e1f1d34c3d03490b6') } ntoreturn:1 keyUpdates:0 reslen:44 300000ms
Tue Oct 30 17:02:54 [conn43] command admin.$cmd command: { writebacklisten: ObjectId('508eee7e1f1d34c3d03490b6') } ntoreturn:1 keyUpdates:0 reslen:44 300000ms
This is only happening on 2 of the 4 shards when I add the secondary. The writebacklisten line comes up in both the failing secondaries and it has the same ObjectId() in the line. I don't know if that means anything or not. Also note that there is no TTL set for this collection.
Here is the rs.status() from the primary after it begins failing. Prior to failing the stateStr is "STARTUP2" with a state of 5.
rs.status()
{
"set" : "cluster3",
"date" : ISODate("2012-10-30T22:03:05Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 706313,
"optime" : Timestamp(1351633318000, 1),
"optimeDate" : ISODate("2012-10-30T21:41:58Z"),
"self" : true
},
{
"_id" : 1,
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 1339,
"lastHeartbeat" : ISODate("2012-10-30T22:03:04Z"),
"pingMs" : 2
},
{
"_id" : 2,
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-10-30T21:56:01Z"),
"pingMs" : 0,
}
],
"ok" : 1
}