the shards primary node crash, not electing self on the secondary

78 views
Skip to first unread message

Erez Zarum

unread,
May 16, 2012, 8:51:50 AM5/16/12
to mongod...@googlegroups.com
MongoDB version 2.0.3, running on CentOS 6.2
I have a cluster which is made out of 3 shards, each shard has 3 nodes in replicaset (1 primary, 2 secondary), 3 config servers, about 20 mongos.
v2.0.x have been stable for quite some time, until today.
I have taken down for maintenance one of the nodes (secondary) from one of the shards, after a few minutes, the primary node on that shard started to receive huge amounts of connections, the log of that primary node started to fill up with "Wed May 16 10:36:45 [conn422884] waiting till out of critical section" (while the connXXXXXX is increasing) and then a lot of "SyncClusterConnection" messages, then the mongod process crashed, I was expecting the third node to take over, but it said "replSet can't see a majority, will not try to elect self", How can i cope with next time when i will have 2 nodes down? I understand that i need an arbiter only for a situation when i have an even number of nodes not when i have an odd number of nodes.

Those are the members when it happened:
shard0000
xx-xxx1 - secondary node
xx-xxx4 - secondary node
xx-xxx7 - primary node
shard0001
xx-xxx2 - secondary node
xx-xxx5 - secondary node
xx-xxx8 - primary node
shard0002
xx-xxx3 - secondary node
xx-xxx6 - secondary node
xx-xxx9 - primary node

config servers are also running on xx-xxx2,xx-xxx4,xx-xxx6 as a different process on port 27019
I took down xx-xxx5 for maintenance, then xx-xxx8 crashed and xx-xxx2 couldn't elect itself as primary, these leaving me with no primary at all.
The IP xx.xx.xx.223 is xx-xxx8.

This is the trigger that caused the whole crash on xx-xxx8
Wed May 16 10:30:22 [conn422864] setShardVersion - relocking slow: 13289
Wed May 16 10:30:22 [conn422867] command admin.$cmd command: { setShardVersion: "storage.object", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 32157000|0, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 13225ms
Wed May 16 10:30:22 [conn422864] command admin.$cmd command: { setShardVersion: "storage.object", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 32157000|0, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 13290ms
Wed May 16 10:30:22 [conn422764] setShardVersion - relocking slow: 13217
Wed May 16 10:30:22 [conn422764] command admin.$cmd command: { setShardVersion: "storage.object", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 32157000|0, serverID: ObjectId('4f52065fee21ac3b06f2ea31'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 13217ms
Wed May 16 10:30:22 [conn422743] setShardVersion - relocking slow: 12954
Wed May 16 10:30:22 [conn422743] command admin.$cmd command: { setShardVersion: "storage.object", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 32157000|0, serverID: ObjectId('4f52065fee21ac3b06f2ea31'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 12954ms
Wed May 16 10:30:24 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx5.xxx.xx:27018
Wed May 16 10:30:24 [ReplicaSetMonitorWatcher] reconnect xx-xxx5.xxx.xx:27018 failed couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:30:34 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx5.xxx.xx:27018
Wed May 16 10:30:34 [ReplicaSetMonitorWatcher] reconnect xx-xxx5.xxx.xx:27018 failed couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:30:38 [conn423408] end connection xx.xx.xx.243:53089
Wed May 16 10:30:38 [initandlisten] connection accepted from xx.xx.xx.243:53112 #423411
Wed May 16 10:30:40 [rsHealthPoll] couldn't connect to xx-xxx5.xxx.xx:27018: couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:30:44 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx5.xxx.xx:27018
Wed May 16 10:30:44 [ReplicaSetMonitorWatcher] reconnect xx-xxx5.xxx.xx:27018 failed couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:30:45 [conn385254] received moveChunk request: { moveChunk: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", to: "shard0000/xx-xxx1.xxx.xx:27018,xx-xxx4.xxx.xx:27018,xx-xxx7.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, maxChunkSizeBytes: 67108864, shardId: "storage.owner-_id_1873292031", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019" }
Wed May 16 10:30:45 [conn385254] created new distributed lock for storage.owner on xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019 ( lock timeout : 900000, ping interval : 30000, process : 0 )
Wed May 16 10:30:45 [conn385254] distributed lock 'storage.owner/xx-xxx8.xxx.xx:27018:1330783413:1345521425' acquired, ts : 4fb381d525fafe9747dcd91b
Wed May 16 10:30:45 [conn385254] about to log metadata event: { _id: "xx-xxx8.xxx.xx-2012-05-16T10:30:45-8235", server: "xx-xxx8.xxx.xx", clientAddr: "xx.xx.xx.233:45538", time: new Date(1337164245316), what: "moveChunk.start", ns: "storage.owner", details: { min: { _id: 1873292031 }, max: { _id: 1873548016 }, from: "shard0001", to: "shard0000" } }
Wed May 16 10:30:45 [conn385254] moveChunk request accepted at version 2036|5
Wed May 16 10:30:45 [conn385254] moveChunk number of documents: 249864
Wed May 16 10:30:46 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:46 [conn423394] command admin.$cmd command: { _migrateClone: 1 } ntoreturn:1 reslen:16776224 1175ms
Wed May 16 10:30:47 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 996, clonedBytes: 32680, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:48 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 3303, clonedBytes: 94741, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:49 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 5376, clonedBytes: 192074, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:50 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 7920, clonedBytes: 270464, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:51 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 10629, clonedBytes: 343185, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:52 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 13871, clonedBytes: 502426, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:53 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 16934, clonedBytes: 603977, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:54 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 19857, clonedBytes: 669925, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:54 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx5.xxx.xx:27018
Wed May 16 10:30:54 [ReplicaSetMonitorWatcher] reconnect xx-xxx5.xxx.xx:27018 failed couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:30:55 [conn260629] command admin.$cmd command: { writebacklisten: ObjectId('4f92171153561c8287f1c03a') } ntoreturn:1 reslen:44 300000ms
Wed May 16 10:30:55 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 22733, clonedBytes: 729590, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:56 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 26056, clonedBytes: 801272, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:57 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 29685, clonedBytes: 881081, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:58 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 33388, clonedBytes: 1007039, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:30:59 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "clone", counts: { cloned: 37203, clonedBytes: 1169065, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0 
A few minutes later:
counts: { cloned: 249864, clonedBytes: 21007934, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:35:55 [conn260629] command admin.$cmd command: { writebacklisten: ObjectId('4f92171153561c8287f1c03a') } ntoreturn:1 reslen:44 300000ms
Wed May 16 10:35:55 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "catchup", counts: { cloned: 249864, clonedBytes: 21007934, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:35:55 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx5.xxx.xx:27018
Wed May 16 10:35:55 [ReplicaSetMonitorWatcher] reconnect xx-xxx5.xxx.xx:27018 failed couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:35:56 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "catchup", counts: { cloned: 249864, clonedBytes: 21007934, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:35:57 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "catchup", counts: { cloned: 249864, clonedBytes: 21007934, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:35:58 [conn385254] moveChunk data transfer progress: { active: true, ns: "storage.owner", from: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018", min: { _id: 1873292031 }, max: { _id: 1873548016 }, state: "steady", counts: { cloned: 249864, clonedBytes: 21007934, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed May 16 10:35:58 [conn385254] moveChunk setting version to: 2037|0
Wed May 16 10:35:58 [conn414] command admin.$cmd command: { writebacklisten: ObjectId('4f5206605db2f6a0b9638970') } ntoreturn:1 reslen:361 164378ms
Wed May 16 10:35:58 [conn628] waiting till out of critical section
Wed May 16 10:35:58 [conn422726] waiting till out of critical section
Wed May 16 10:35:58 [conn422726] waiting till out of critical section
Wed May 16 10:35:58 [conn422710] waiting till out of critical section
Wed May 16 10:35:58 [conn628] waiting till out of critical section
Wed May 16 10:35:58 [conn422802] waiting till out of critical section
Wed May 16 10:35:58 [conn422802] waiting till out of critical section
Wed May 16 10:35:58 [conn423297] waiting till out of critical section
Wed May 16 10:35:58 [conn422726] waiting till out of critical section
Wed May 16 10:35:58 [conn422700] waiting till out of critical section
Wed May 16 10:35:58 [conn422707] waiting till out of critical section
Wed May 16 10:35:58 [conn422793] waiting till out of critical section
Wed May 16 10:35:58 [conn422709] waiting till out of critical section
Wed May 16 10:35:58 [conn423214] waiting till out of critical section
Wed May 16 10:35:58 [conn422790] waiting till out of critical section
Wed May 16 10:35:58 [conn422726] waiting till out of critical section
Wed May 16 10:35:58 [conn422710] waiting till out of critical section
Wed May 16 10:35:58 [conn628] waiting till out of critical section
Wed May 16 10:35:58 [conn422793] waiting till out of critical section
Then this happened:
Wed May 16 10:37:00 [conn422710] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660f8e8cc0bcb4c9861'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 62590ms
Wed May 16 10:37:00 [conn423233] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660f5c00fa8c3d3d79a'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 61617ms
Wed May 16 10:37:00 [conn423263] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660fb858c3d243e2c0e'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 58562ms
Wed May 16 10:37:00 [conn422866] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 58941ms
Wed May 16 10:37:00 [conn423271] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660f5c00fa8c3d3d79a'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 59922ms
Wed May 16 10:37:00 [conn422714] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660f8e8cc0bcb4c9861'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 62297ms
Wed May 16 10:37:00 [conn423009] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f52066018deafc41f3009cb'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 3928ms
Wed May 16 10:37:00 [conn422843] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 59145ms
Wed May 16 10:37:00 [conn422644] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f5206606a4e0e86d55c499e'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 57976ms
Wed May 16 10:37:00 [conn423073] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660d197523e0795f6a6'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 12872ms
Wed May 16 10:37:00 [conn422734] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660f8e8cc0bcb4c9861'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 60217ms
Wed May 16 10:37:00 [conn423218] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660fb858c3d243e2c0e'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 60143ms
Wed May 16 10:37:00 [conn422997] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660255a464978446fc1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 23985ms
Wed May 16 10:37:00 [conn422861] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2036000|5, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:230 59007ms
Wed May 16 10:37:00 [conn385254] doing delete inline
Wed May 16 10:37:00 [conn423219] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423290] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423261] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423256] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423185] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423187] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423182] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
Wed May 16 10:37:00 [conn423264] SyncClusterConnection connecting to [xx-xxx2.xxx.xx:27019]
I got a lot of SyncClusterConnection messages like those, it simply flooded the log file with those.
Wed May 16 10:37:02 [conn423306] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423093] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn422873] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423174] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423069] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn422908] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423170] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423167] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423190] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:02 [conn423168] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:04 [conn423300] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:04 [conn423169] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:10 [initandlisten] connection accepted from xx.xx.xx.253:43727 #423428
Wed May 16 10:37:10 [initandlisten] connection accepted from xx.xx.xx.45:43280 #423429
Wed May 16 10:37:11 [ReplicaSetMonitorWatcher] Socket recv() timeout  xx.xx.xx.223:27018
Wed May 16 10:37:11 [ReplicaSetMonitorWatcher] SocketException: remote: xx.xx.xx.223:27018 error: 9001 socket exception [3] server [xx.xx.xx.223:27018]
Wed May 16 10:37:11 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Wed May 16 10:37:11 [initandlisten] connection accepted from xx.xx.xx.222:57672 #423430
Wed May 16 10:37:11 [conn782] command admin.$cmd command: { writebacklisten: ObjectId('4f52065fee21ac3b06f2ea31') } ntoreturn:1 reslen:44 300126ms
Wed May 16 10:37:11 [conn273786] end connection xx.xx.xx.222:41724
Wed May 16 10:37:12 [ReplicaSetMonitorWatcher] trying reconnect to xx-xxx8.xxx.xx:27018
Wed May 16 10:37:13 [conn422869] SyncClusterConnection connecting to [xx-xxx4.xxx.xx:27019]
Wed May 16 10:37:09 [conn423284] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:15 [conn56] end connection xx.xx.xx.45:51201
Wed May 16 10:37:15 [conn76] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 8519ms
Wed May 16 10:37:15 [conn76] end connection xx.xx.xx.230:42721
Wed May 16 10:37:15 [conn80] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 7981ms
Wed May 16 10:37:15 [conn80] end connection xx.xx.xx.236:58130
Wed May 16 10:37:15 [conn71] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 7527ms
Wed May 16 10:37:15 [conn71] end connection xx.xx.xx.231:60271
Wed May 16 10:37:18 [conn68] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 8687ms
Wed May 16 10:37:18 [conn68] end connection xx.xx.xx.243:43031
Wed May 16 10:37:18 [conn77] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 7564ms
Wed May 16 10:37:18 [conn77] end connection xx.xx.xx.235:49301
Wed May 16 10:37:18 [conn79] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 7162ms
Wed May 16 10:37:18 [conn79] end connection xx.xx.xx.237:39346
Wed May 16 10:37:18 [conn74] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 8688ms
Wed May 16 10:37:18 [conn74] end connection xx.xx.xx.232:56111
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.237:45828 #423431
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.239:35750 #423432
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.229:38421 #423433
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.230:57010 #423434
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.236:39561 #423435
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.231:59722 #423436
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.243:44792 #423437
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.232:49545 #423438
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.235:54442 #423439
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.45:43283 #423440
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.242:54990 #423441
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.234:55494 #423442
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.233:34484 #423443
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.51:56798 #423444
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.237:45909 #423445
Wed May 16 10:37:18 [initandlisten] connection accepted from xx.xx.xx.229:38425 #423446
Wed May 16 10:37:18 [conn389567] end connection xx.xx.xx.253:40095
Wed May 16 10:37:24 [conn82] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 17621ms
Wed May 16 10:37:28 [conn82] end connection xx.xx.xx.239:44799
Wed May 16 10:37:25 [conn70] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 14518ms
Wed May 16 10:37:28 [conn70] end connection xx.xx.xx.234:40259
Wed May 16 10:37:25 [conn81] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 14371ms
Wed May 16 10:37:28 [conn81] end connection xx.xx.xx.51:55873
Wed May 16 10:37:25 [conn78] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 14339ms
Wed May 16 10:37:28 [conn78] end connection xx.xx.xx.233:48472
Wed May 16 10:37:25 [conn417128] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 14022ms
Wed May 16 10:37:28 [conn417128] end connection xx.xx.xx.229:42320
Wed May 16 10:37:25 [conn311421] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 13300ms
Wed May 16 10:37:28 [conn311421] end connection xx.xx.xx.224:54010
Wed May 16 10:37:25 [conn65] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 12673ms
Wed May 16 10:37:28 [conn65] end connection xx.xx.xx.227:41715
Wed May 16 10:37:25 [conn423426] command admin.$cmd command: { replSetHeartbeat: "shard0001", v: 7, pv: 1, checkEmpty: false, from: "xx-xxx2.xxx.xx:27018" } ntoreturn:1 reslen:125 11561ms
Wed May 16 10:37:28 [conn423426] end connection xx.xx.xx.243:44683
Wed May 16 10:37:25 [conn63] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 10942ms
Wed May 16 10:37:28 [conn63] end connection xx.xx.xx.228:53864
Wed May 16 10:37:25 [conn69] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 10796ms
Wed May 16 10:37:28 [conn69] end connection xx.xx.xx.251:50241
Wed May 16 10:37:25 [conn260619] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 reslen:265 14518ms
Wed May 16 10:37:26 [conn72] end connection xx.xx.xx.250:35094
Wed May 16 10:37:28 [conn260619] end connection xx.xx.xx.242:60661
Wed May 16 10:37:36 [conn271082] command admin.$cmd command: { writebacklisten: ObjectId('4f9216a651d5e36fce63adc3') } ntoreturn:1 reslen:44 300004ms
Wed May 16 10:37:49 [conn390969] command admin.$cmd command: { writebacklisten: ObjectId('4facdfa170f221030308efc7') } ntoreturn:1 reslen:44 300128ms
Wed May 16 10:37:51 [conn423295] query storage.object ntoreturn:1 idhack:1 reslen:38 50152ms
Wed May 16 10:37:51 [conn423255] query storage.owner ntoreturn:1 idhack:1 reslen:101 50152ms
Wed May 16 10:37:51 [conn423237] SyncClusterConnection connecting to [xx-xxx6.xxx.xx:27019]
Wed May 16 10:37:51 [conn423220] query storage.object_interaction ntoreturn:1 idhack:1 reslen:276 50167ms
That's where it crashed:
Wed May 16 10:37:51 [conn422710] setShardVersion - relocking slow: 50896
Wed May 16 10:37:51 [conn422702] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660f8e8cc0bcb4c9861'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50897ms
Wed May 16 10:37:51 [conn422849] setShardVersion - relocking slow: 50956
Wed May 16 10:37:51 [conn422710] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660f8e8cc0bcb4c9861'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50896ms
Wed May 16 10:37:51 [conn404944] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206605db2f6a0b9638970'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50972ms
Wed May 16 10:37:51 [conn422868] setShardVersion - relocking slow: 50955
Wed May 16 10:37:51 [conn422849] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50957ms
Wed May 16 10:37:51 [conn422949] setShardVersion - relocking slow: 50931
Wed May 16 10:37:51 [conn422877] setShardVersion - relocking slow: 50948
Wed May 16 10:37:51 [conn423192] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422868] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50957ms
Wed May 16 10:37:51 [conn422877] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206600b62806eed31df9e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50950ms
Wed May 16 10:37:51 [conn422897] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn395117] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206605db2f6a0b9638970'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50972ms
Wed May 16 10:37:51 [conn423019] setShardVersion - relocking slow: 50953
Wed May 16 10:37:51 [conn422949] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660e5175a65bdc9db52'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50932ms
Wed May 16 10:37:51 [conn423043] setShardVersion - relocking slow: 50951
Wed May 16 10:37:51 dbexit:
Wed May 16 10:37:51 [conn422921] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422888] setShardVersion - relocking slow: 50950
Wed May 16 10:37:51 [conn423019] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52066018deafc41f3009cb'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50953ms
Wed May 16 10:37:51 dbexit: ; exiting immediately
Wed May 16 10:37:51 [conn423043] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52066018deafc41f3009cb'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50953ms
Wed May 16 10:37:51 [conn404854] setShardVersion - relocking slow: 50903
Wed May 16 10:37:51 [conn422897] shutdown: going to close listening sockets...
Wed May 16 10:37:51 [conn422896] setShardVersion - relocking slow: 50950
Wed May 16 10:37:51 [conn422897] closing listening socket: 5
Wed May 16 10:37:51 [conn422739] setShardVersion - relocking slow: 50956
Wed May 16 10:37:51 [conn422897] closing listening socket: 6
Wed May 16 10:37:51 [conn404854] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206606a4e0e86d55c499e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50915ms
Wed May 16 10:37:51 [conn422888] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206600b62806eed31df9e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50951ms
Wed May 16 10:37:51 [conn422952] setShardVersion - relocking slow: 50933
Wed May 16 10:37:51 [conn422896] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206600b62806eed31df9e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50951ms
Wed May 16 10:37:51 [conn422882] setShardVersion - relocking slow: 50950
Wed May 16 10:37:51 [conn422897] closing listening socket: 7
Wed May 16 10:37:51 [conn423037] setShardVersion - relocking slow: 50951
Wed May 16 10:37:51 [conn422739] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52065fee21ac3b06f2ea31'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50959ms
Wed May 16 10:37:51 [conn422897] removing socket file: /tmp/mongodb-27018.sock
Wed May 16 10:37:51 [conn422952] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660e5175a65bdc9db52'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50934ms
Wed May 16 10:37:51 [conn422882] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f5206600b62806eed31df9e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50951ms
Wed May 16 10:37:51 [conn422897] shutdown: going to flush diaglog...
Wed May 16 10:37:51 [conn422897] shutdown: going to close sockets...
Wed May 16 10:37:51 [conn422897] shutdown: waiting for fs preallocator...
Wed May 16 10:37:51 [conn423037] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52066018deafc41f3009cb'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50955ms
Wed May 16 10:37:51 [conn422985] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422897] shutdown: closing all files...
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423029] setShardVersion - relocking slow: 50958
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422773] setShardVersion - relocking slow: 50963
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422861] setShardVersion - relocking slow: 50960
Wed May 16 10:37:51 [conn422773] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52065fee21ac3b06f2ea31'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50963ms
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423063] setShardVersion - relocking slow: 50950
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423029] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52066018deafc41f3009cb'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50958ms
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422861] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660fb8d26ba4a9b1ed1'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50960ms
Wed May 16 10:37:51 [conn423073] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn423063] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660d197523e0795f6a6'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50950ms
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423298] setShardVersion - relocking slow: 50982
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423298] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f520660fb858c3d243e2c0e'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50986ms
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422918] setShardVersion - relocking slow: 50941
Wed May 16 10:37:51 [conn422703] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423015] setShardVersion - relocking slow: 50960
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422956] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn423053] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn423015] command admin.$cmd command: { setShardVersion: "storage.owner", configdb: "xx-xxx2.xxx.xx:27019,xx-xxx4.xxx.xx:27019,xx-xxx6.xxx.xx:27019", version: Timestamp 2037000|1, serverID: ObjectId('4f52066018deafc41f3009cb'), authoritative: true, shard: "shard0001", shardHost: "shard0001/xx-xxx2.xxx.xx:27018,xx-xxx5.xxx.xx:27018,xx-xxx8.xxx.xx:27018" } ntoreturn:1 reslen:57 50960ms
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422651] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422929] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422876] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422672] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Then later
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422674] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422803] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422899] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422715] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422774] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422920] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422708] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422931] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Logstream::get called in uninitialized stateWed May 16 10:37:51 [conn422644] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating

Logstream::get called in uninitialized state
Wed May 16 10:37:51 [conn422934] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422815] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn423149] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422746] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 [conn422762] ERROR: Uncaught std::exception: boost::thread_resource_error, terminating
Wed May 16 10:37:51 Backtrace:
0xa90d79 0xa91350 0x33b3c0f4a0 0x54e737 0x5591c5 0x559e94 0x8e08a5 0x8e14ba 0x95f69d 0x964415 0x88b86c 0x88eb5f 0xaa37d6 0x637497 0x33b3c077f1 0x33b38e592d
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90d79]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa91350]
 /lib64/libpthread.so.0() [0x33b3c0f4a0]
 /usr/bin/mongod(_ZN5mongo10FieldRangeC1ERKNS_11BSONElementEbbb+0x197) [0x54e737]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet17processQueryFieldERKNS_11BSONElementEb+0x85) [0x5591c5]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x194) [0x559e94]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScannerC1EPKcRKNS_7BSONObjES5_PKNS_11BSONElementEbS5_S5_bb+0x205) [0x8e08a5]
 /usr/bin/mongod(_ZN5mongo11MultiCursorC1EPKcRKNS_7BSONObjES5_N5boost10shared_ptrINS0_8CursorOpEEEb+0xda) [0x8e14ba]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x35d) [0x95f69d]
 /usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x125) [0x964415]
 /usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x47c) [0x88b86c]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x116f) [0x88eb5f]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa37d6]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x637497]
 /lib64/libpthread.so.0() [0x33b3c077f1]
 /lib64/libc.so.6(clone+0x6d) [0x33b38e592d]

At this point i had no option to log to the mongod process at all (port 27018), i tried to kill it safely, but i had to kill -9 the mongod process.

That's how the secondary node (xx-xxx2) log file looked like when it started to detect the primary node crashed:
Wed May 16 10:36:25 [rsHealthPoll] couldn't connect to xx-xxx5.xxx.xx:27018: couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:36:39 [conn402685] end connection xx.xx.xx.223:36431
Wed May 16 10:36:39 [initandlisten] connection accepted from xx.xx.xx.223:36451 #402686
Wed May 16 10:36:55 [rsHealthPoll] couldn't connect to xx-xxx5.xxx.xx:27018: couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:36:56 [conn187] command admin.$cmd command: { writebacklisten: ObjectId('4f5206606a4e0e86d55c499e') } ntoreturn:1 reslen:44 300040ms
Wed May 16 10:37:05 [conn257844] command admin.$cmd command: { writebacklisten: ObjectId('4f9216a651d5e36fce63adc3') } ntoreturn:1 reslen:44 300041ms
Wed May 16 10:37:12 [conn334] command admin.$cmd command: { writebacklisten: ObjectId('4f520660255a464978446fc1') } ntoreturn:1 reslen:44 300041ms
Wed May 16 10:37:18 [rsHealthPoll] DBClientCursor::init call() failed
Wed May 16 10:37:18 [rsHealthPoll] replSet info xx-xxx8.xxx.xx:27018 is down (or slow to respond): DBClientBase::findN: transport error: xx-xxx8.xxx.xx:27018 query: { replSetHeartbeat: "shard0001", v: 7, pv: 1, checkEmpty: false, from: "xx-xxx2.xxx.xx:27018" }
Wed May 16 10:37:18 [rsHealthPoll] replSet member xx-xxx8.xxx.xx:27018 is now in state DOWN
Wed May 16 10:37:18 [rsMgr] replSet can't see a majority, will not try to elect self
Wed May 16 10:37:20 [conn402686] end connection xx.xx.xx.223:36451
Wed May 16 10:37:25 [rsHealthPoll] couldn't connect to xx-xxx5.xxx.xx:27018: couldn't connect to server xx-xxx5.xxx.xx:27018
Wed May 16 10:37:29 [clientcursormon] mem (MB) res:51298 virt:1241040 mapped:1240643
Wed May 16 10:37:30 [rsHealthPoll] DBClientCursor::init call() failed
Wed May 16 10:37:31 [initandlisten] connection accepted from xx.xx.xx.220:34049 #402687
Wed May 16 10:37:31 [conn402687] end connection xx.xx.xx.220:34049
Wed May 16 10:37:37 [initandlisten] connection accepted from xx.xx.xx.220:34098 #402688
Wed May 16 10:37:37 [conn402688] end connection xx.xx.xx.220:34098
Wed May 16 10:37:42 [rsHealthPoll] DBClientCursor::init call() failed
Wed May 16 10:37:51 [rsHealthPoll] DBClientCursor::init call() failed
Wed May 16 10:37:52 [conn456] end connection xx.xx.xx.223:50073
Wed May 16 10:37:53 [initandlisten] connection accepted from xx.xx.xx.223:37586 #402689

Scott Hernandez

unread,
May 16, 2012, 9:00:27 AM5/16/12
to mongod...@googlegroups.com
Is the summary of the replica set states?

3 nodes:
1.) first down for (hardware) maintanance
2.) second node failed (resource issue)
3.) only one node left, cannot be primary.

If this is the correct timeline then the reason for 3 is because you
must have a majority to elect a primary -- 1 of 3 is not a majority.
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb

Erez Zarum

unread,
May 16, 2012, 10:08:15 AM5/16/12
to mongod...@googlegroups.com
In this situation, i need to worry when 1 node goes down? 
The solution i see is:
1) add 2 arbiters (To create odd number of members)
2) add 1 node (another replica node, secondary) + 1 arbiter
In this situation, i can have 2 nodes going down
Correct?

Eliot Horowitz

unread,
May 17, 2012, 1:03:16 AM5/17/12
to mongod...@googlegroups.com
Yes, your analysis seems correct.

Generally, if you have 5 nodes, you can lose 2 as 3 is still a majority.
>> > mongodb-user...@googlegroups.com
>> > See also the IRC channel -- freenode.net#mongodb
>
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com

Erez Zarum

unread,
May 17, 2012, 8:23:12 AM5/17/12
to mongod...@googlegroups.com
In this situation, adding 2 arbiters can help, no?
What about the root cause that the mongod process crashed? is it fixed in 2.0.5?
>> > See also the IRC channel -- freenode.net#mongodb
>
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com
> To unsubscribe from this group, send email to

Barrie

unread,
Jul 6, 2012, 4:22:01 PM7/6/12
to mongod...@googlegroups.com
The problem with adding 2 arbiters is that, while it does provide a majority of votes should 2 of your nodes go down, by electing the remaining secondary as primary you'll find yourself with automatic failover but without replication, as your replica set consist of a primary and 2 arbiters. What would be better would be to add another arbiter, and another data node, so in the off-chance that 2 of your nodes going down again you'll elect a new primary and have at least one other copy of the data on a secondary as well. 

As for the crash you can try reducing the connection pool size to prevent this from happening again.
Reply all
Reply to author
Forward
0 new messages