Problems on chunk split with 1.8.0 rc2

36 views
Skip to first unread message

Marc

unread,
Mar 15, 2011, 5:55:41 AM3/15/11
to mongodb-user
I've just upgraded from 1.6.5
Everything looks like going fine, (and much better ).
I had problems with shard chunk copy and splitting (very low
performance). I readed that with 1.8.0 all this will go much better so
I upgraded.

The problem now is that looks like something wrong is going on with
spliting:

At some point on one of my partitions appears this message:
Tue Mar 15 10:45:15 [conn568] request split points lookup for chunk
crawler.ad { : 10662019385342 } -->> { : MaxKey }

And at the same time, on the two mongos processes being used start to
appear lots of this messages:

....
/home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
/home/mongo/bin/
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE
+0x141) [0x67ae11]
/home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
+0x272) [0x5817a2]
/home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
/lib64/libpthread.so.0 [0x386c406367]
/lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
Tue Mar 15 10:45:56 [conn14] AssertionException in process:
setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Tue Mar 15 10:45:57 [conn14] update failed b/c of
StaleConfigException, retrying left:4 ns: crawler.ad query: { _id:
662005864631 }
Tue Mar 15 10:45:57 [conn14] ns: crawler.ad ClusteredCursor::query
ShardConnection had to change attempt: 0
Tue Mar 15 10:45:58 [conn17] ns: crawler.ad ClusteredCursor::query
ShardConnection had to change attempt: 0
Tue Mar 15 10:45:58 [conn14] setShardVersion failed: { errmsg:
"not master", ok: 0.0 }
Tue Mar 15 10:45:58 [conn14] Assertion: 10429:setShardVersion failed!
{ "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6
0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x386c406367
0x386bcd309d
/home/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
/home/mongo/bin/mongos [0x69ee23]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/
mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi
+0x16) [0x5799e6]
/home/mongo/bin/
mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
[0x57a2f6]
/home/mongo/bin/
mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
+0x3d) [0x57781d]
/home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
/home/mongo/bin/
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE
+0x141) [0x67ae11]
/home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
+0x272) [0x5817a2]
/home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
/lib64/libpthread.so.0 [0x386c406367]
/lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
Tue Mar 15 10:45:58 [conn14] AssertionException in process:
setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Tue Mar 15 10:45:58 [conn17] setShardVersion failed: { errmsg:
"not master", ok: 0.0 }
Tue Mar 15 10:45:58 [conn17] Assertion: 10429:setShardVersion failed!
{ "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6
0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x386c406367
0x386bcd309d
/home/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
/home/mongo/bin/mongos [0x69ee23]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/
mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi
+0x16) [0x5799e6]
/home/mongo/bin/
mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
[0x57a2f6]
/home/mongo/bin/
mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
+0x3d) [0x57781d]
/home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
/home/mongo/bin/
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE
+0x141) [0x67ae11]
/home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
+0x272) [0x5817a2]
/home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
/lib64/libpthread.so.0 [0x386c406367]
/lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
Tue Mar 15 10:45:58 [conn17] AssertionException in process:
setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Tue Mar 15 10:45:58 [conn14] ns: crawler.ad ClusteredCursor::query
ShardConnection had to change attempt: 0
Tue Mar 15 10:45:58 [conn14] setShardVersion failed: { errmsg:
"not master", ok: 0.0 }
Tue Mar 15 10:45:58 [conn14] Assertion: 10429:setShardVersion failed!
{ "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6
0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x386c406367
0x386bcd309d
/home/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
/home/mongo/bin/mongos [0x69ee23]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/
mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi
+0x16) [0x5799e6]
/home/mongo/bin/
mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
[0x57a2f6]
/home/mongo/bin/
mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
+0x3d) [0x57781d]
/home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
/home/mongo/bin/
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE
+0x141) [0x67ae11]
/home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
+0x272) [0x5817a2]
/home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
/lib64/libpthread.so.0 [0x386c406367]
/lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
Tue Mar 15 10:45:58 [conn14] AssertionException in process:
setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Tue Mar 15 10:45:59 [LockPinger] dist_lock pinged successfully for:
OBE010.local:1300111198:1804289383
Tue Mar 15 10:45:59 [conn14] ns: crawler.ad ClusteredCursor::query
ShardConnection had to change attempt: 0
Tue Mar 15 10:45:59 [conn14] setShardVersion failed: { errmsg:
"not master", ok: 0.0 }
Tue Mar 15 10:45:59 [conn14] Assertion: 10429:setShardVersion failed!
{ "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6
0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x386c406367
0x386bcd309d
/home/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
/home/mongo/bin/mongos [0x69ee23]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/mongos [0x69e90b]
/home/mongo/bin/
mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi
+0x16) [0x5799e6]
/home/mongo/bin/
mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
[0x57a2f6]
/home/mongo/bin/
mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
+0x3d) [0x57781d]
/home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
/home/mongo/bin/
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE
+0x141) [0x67ae11]
/home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
+0x272) [0x5817a2]
/home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
/lib64/libpthread.so.0 [0x386c406367]
/lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
Tue Mar 15 10:45:59 [conn14] AssertionException in process:
setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Tue Mar 15 10:45:59 [conn13] AssertionException in process: ns:
crawler.ad ClusteredCursor::query ShardConnection had to change
Tue Mar 15 10:46:01 [conn14] AssertionException in process: ns:
crawler.ad ClusteredCursor::query ShardConnection had to change


At this point, stops sending messages until after some time another:

request split points lookup for chunk crawler.ad { : 10662019385342 }
-->> { : MaxKey }

Appears and everything repeats again.

It's doing this every 5 minutes or so...

* Is the splitting working?
* If not, there's any way to solve this?

Jared Rosoff

unread,
Mar 15, 2011, 7:47:23 PM3/15/11
to mongodb-user
Hi Marc,

I'd like to look at more of your logs...

Can you go to jira.mongodb.org, register if you haven't already, and
create a Community Private issue? Then attach your logs to that.

I'll be sure to post back to the group so everyone can see what's
going on.

Thanks.
> +0x141) [0x67ae11]
>  /home/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE
> +0x272) [0x5817a2]
>  /home/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
>  /lib64/libpthread.so.0 [0x386c406367]
>  /lib64/libc.so.6(clone+0x6d) [0x386bcd309d]
> Tue Mar 15 10:45:58 [conn17] AssertionException in process:
> setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
> Tue Mar 15 10:45:58 [conn14] ns: crawler.ad ClusteredCursor::query
> ShardConnection had to change attempt: 0
> Tue Mar 15 10:45:58 [conn14]      setShardVersion failed: { errmsg:
> "not master", ok: 0.0 }
> Tue Mar 15 10:45:58 [conn14] Assertion: 10429:setShardVersion failed!
> { "errmsg" : "not master", "ok" : 0 }
> 0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6
> 0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x386c406367
> 0x386bcd309d
>  /home/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
>  /home/mongo/bin/mongos [0x69ee23]
>  /home/mongo/bin/mongos [0x69e90b]
>  /home/mongo/bin/mongos [0x69e90b]
>  /home/mongo/bin/mongos [0x69e90b]
>  /home/mongo/bin/
> mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBa seERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi
> +0x16) [0x5799e6]
>  /home/mongo/bin/
> mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
> [0x57a2f6]
>  /home/mongo/bin/
> mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
> +0x3d) [0x57781d]
>  /home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
>  /home/mongo/bin/
> mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21Abstract MessagingPortE
> +0x16) [0x5799e6]
>  /home/mongo/bin/
> mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6)
> [0x57a2f6]
>  /home/mongo/bin/
> mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs
> +0x3d) [0x57781d]
>  /home/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
>  /home/mongo/bin/
> mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21Abstract MessagingPortE

Marc

unread,
Mar 16, 2011, 5:50:29 AM3/16/11
to mongodb-user
Here it goes

http://jira.mongodb.org/browse/SUPPORT-101

I have to wait for the exception to appear again on a different
mongos, (Already had the split problem, but starts throwing exceptions
after a while). But the exceptions are the same posted.
I've restarted the mongos because the application were having problems
(not master errors).
I've attached the log as it is now, with the first splits errors.

Marc Gracia

unread,
Mar 21, 2011, 7:44:14 AM3/21/11
to mongod...@googlegroups.com
I must say this is still happening.
Everytime an autosplit or mongo movement is initated, the mongoS starts throwing expections and stops working. 

I really need to solve this, anyone have any idea on why is this happening? Any workaround?
Stopping the balancer did not help.
Maybe is some config server problem? I cannot see any error there, but the "Stale config" message makes me think maybe it's related.

Can be some out of sync or stale lock somewhere?




Eliot Horowitz

unread,
Mar 21, 2011, 7:46:08 AM3/21/11
to mongod...@googlegroups.com
Can you send the latest error messages?
Also can you upgrade to 1.8.0 if you haven't and make sure all mongod
and mongos are on it.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>

Marc Gracia

unread,
Mar 21, 2011, 8:36:27 AM3/21/11
to mongod...@googlegroups.com
Related mongos log of one of the latest fail:

Fri Mar 18 18:05:05 [LockPinger] dist_lock pinged successfully for: OBE017.local:1300448993:1804289383
Fri Mar 18 18:06:06 [mongosMain] connection accepted from 62.212.84.220:60590 #34
Fri Mar 18 18:10:05 [LockPinger] dist_lock pinged successfully for: OBE017.local:1300448993:1804289383
Fri Mar 18 18:15:05 [LockPinger] dist_lock pinged successfully for: OBE017.local:1300448993:1804289383
Fri Mar 18 18:18:09 [conn24] autosplitted crawler.ad shard: ns:crawler.ad at: partitionB:partitionB/flash,gordon:27018 lastmod: 138|0 min: { _id: 662009843887 } max: { _id: 662009883156 } on: { _id: 10000068480361 }(splitThreshold 209715200)
Fri Mar 18 18:18:09 [conn24] update failed b/c of StaleConfigException, retrying  left:4 ns: crawler.ad query: { _id: 662009274178 }
Fri Mar 18 18:18:11 [conn28] ns: crawler.ad ClusteredCursor::query ShardConnection had to change attempt: 0
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [flash:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [gordon:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [mara:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [flash:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [gordon:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [mara:27019]
Fri Mar 18 18:18:11 [conn28]      setShardVersion failed: { errmsg: "not master", ok: 0.0 }
Fri Mar 18 18:18:11 [conn28] Assertion: 10429:setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6 0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x30e920673d 0x30e8ad3f6d
 /opt/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
 /opt/mongo/bin/mongos [0x69ee23]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi+0x16) [0x5799e6]
 /opt/mongo/bin/mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6) [0x57a2f6]
 /opt/mongo/bin/mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs+0x3d) [0x57781d]
 /opt/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]
 /opt/mongo/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x141) [0x67ae11]
 /opt/mongo/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x272) [0x5817a2]
 /opt/mongo/bin/mongos(thread_proxy+0x80) [0x6a26f0]
 /lib64/libpthread.so.0 [0x30e920673d]
 /lib64/libc.so.6(clone+0x6d) [0x30e8ad3f6d]
Fri Mar 18 18:18:11 [conn28] AssertionException in process: setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
Fri Mar 18 18:18:11 [conn28] ns: crawler.ad ClusteredCursor::query ShardConnection had to change attempt: 0
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [flash:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [gordon:27019]
Fri Mar 18 18:18:11 [conn28] SyncClusterConnection connecting to [mara:27019]
Fri Mar 18 18:18:11 [conn28]      setShardVersion failed: { errmsg: "not master", ok: 0.0 }
Fri Mar 18 18:18:11 [conn28] Assertion: 10429:setShardVersion failed! { "errmsg" : "not master", "ok" : 0 }
0x522c19 0x69ee23 0x69e90b 0x69e90b 0x69e90b 0x5799e6 0x57a2f6 0x57781d 0x6680b6 0x67ae11 0x5817a2 0x6a26f0 0x30e920673d 0x30e8ad3f6d
 /opt/mongo/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x522c19]
 /opt/mongo/bin/mongos [0x69ee23]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos [0x69e90b]
 /opt/mongo/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi+0x16) [0x5799e6]
 /opt/mongo/bin/mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6) [0x57a2f6]
 /opt/mongo/bin/mongos(_ZN5mongo15ShardConnection25checkMyConnectionVersionsERKSs+0x3d) [0x57781d]
 /opt/mongo/bin/mongos(_ZN5mongo7Request7processEi+0x5e6) [0x6680b6]

 ... ad infinitum. Until restarted.


On partitionB mongod log  I've found this around the same time:

...
Fri Mar 18 18:18:09 [conn1195] end connection 95.211.11.30:33282
Fri Mar 18 18:18:09 [conn1200] end connection 95.211.11.30:33292
Fri Mar 18 18:18:09 [conn1366] end connection 95.211.11.30:49865
Fri Mar 18 18:18:09 [initandlisten] connection accepted from 95.211.11.30:35538 #1498
Fri Mar 18 18:18:09 [conn1498] request split points lookup for chunk crawler.ad { : 662009843887 } -->> { : 10000068480361 }
Fri Mar 18 18:18:09 [initandlisten] connection accepted from 94.75.192.38:33921 #1499
Fri Mar 18 18:18:09 [conn1499] end connection 94.75.192.38:33921
Fri Mar 18 18:18:09 [conn1498] max number of requested split points reached (2) before the end of chunk crawler.ad { : 662009843887 } -->> { : 10000068480361 }
Fri Mar 18 18:18:09 [conn1498] warning: Finding the split vector for crawler.ad over { _id: 1.0 } keyCount: 39243 numSplits: 2 lookedAt: 0 took 117ms
Fri Mar 18 18:18:09 [conn1498] query admin.$cmd ntoreturn:1 command: { splitVector: "crawler.ad", keyPattern: { _id: 1.0 }, min: { _id: 662009843887 }, max: { _id: 10000068480361 }, maxChunkSizeBytes: 209715200, maxSplitPoints: 2, maxChunkObjects: 250000 } reslen:111 117ms
Fri Mar 18 18:18:09 [conn1498] SyncClusterConnection connecting to [flash:27019]
Fri Mar 18 18:18:09 [conn1498] SyncClusterConnection connecting to [gordon:27019]
Fri Mar 18 18:18:09 [conn1498] SyncClusterConnection connecting to [mara:27019]
Fri Mar 18 18:18:09 [conn1498] updated set (partitionB) to: partitionB/flash,gordon:27018
Fri Mar 18 18:18:09 [initandlisten] connection accepted from 95.211.11.30:35543 #1500
Fri Mar 18 18:18:09 [initandlisten] connection accepted from 95.211.11.30:35544 #1501
Fri Mar 18 18:18:09 [conn1498] received splitChunk request: { splitChunk: "crawler.ad", keyPattern: { _id: 1.0 }, min: { _id: 662009843887 }, max: { _id: 10000068480361 }, from: "partitionB/flash,gordon:27018", splitKeys: [ { _id: 662009883156 } ], shardId: "crawler.ad-_id_662009843887", configdb: "flash:27019,gordon:27019,mara:27019" }
Fri Mar 18 18:18:09 [LockPinger] creating dist lock ping thread for: flash:27019,gordon:27019,mara:27019
Fri Mar 18 18:18:09 [LockPinger] SyncClusterConnection connecting to [flash:27019]
Fri Mar 18 18:18:09 [LockPinger] SyncClusterConnection connecting to [gordon:27019]
Fri Mar 18 18:18:09 [LockPinger] SyncClusterConnection connecting to [mara:27019]
Fri Mar 18 18:18:09 [conn1498] splitChunk accepted at version 138|0
Fri Mar 18 18:18:09 [conn1498] about to log metadata event: { _id: "OBE017.local-2011-03-18T17:18:09-1", server: "OBE017.local", clientAddr: "95.211.11.30:35538", time: new Date(1300468689635), what: "split", ns: "crawler.ad", details: { before: { min: { _id: 662009843887 }, max: { _id: 10000068480361 }, lastmod: Timestamp 138000|0 }, left: { min: { _id: 662009843887 }, max: { _id: 662009883156 }, lastmod: Timestamp 138000|2 }, right: { min: { _id: 662009883156 }, max: { _id: 10000068480361 }, lastmod: Timestamp 138000|3 } } }
Fri Mar 18 18:18:09 [conn1498] query admin.$cmd ntoreturn:1 command: { splitChunk: "crawler.ad", keyPattern: { _id: 1.0 }, min: { _id: 662009843887 }, max: { _id: 10000068480361 }, from: "partitionB/flash,gordon:27018", splitKeys: [ { _id: 662009883156 } ], shardId: "crawler.ad-_id_662009843887", configdb: "flash:27019,gordon:27019,mara:27019" } reslen:53 487ms
Fri Mar 18 18:19:00 [conn1191] insert crawler.ad 390ms
Fri Mar 18 18:19:24 [conn159] query admin.$cmd ntoreturn:1 command: { writebacklisten: ObjectId('4d249c0f2daa327869f816c7') } reslen:60 300048ms
Fri Mar 18 18:19:24 [conn163] query admin.$cmd ntoreturn:1 command: { writebacklisten: ObjectId('4d249b1d816216524426d17b') } reslen:60 300048ms
Fri Mar 18 18:19:56 [conn1183] query admin.$cmd ntoreturn:1 command: { writebacklisten: ObjectId('4d8346e06ab0f7d672b0b2f2') } reslen:60 300048ms
Fri Mar 18 18:20:25 [initandlisten] connection accepted from 94.75.192.38:33966 #1502
Fri Mar 18 18:20:25 [conn1502] end connection 94.75.192.38:33966
Fri Mar 18 18:20:34 [initandlisten] connection accepted from 94.75.192.38:33980 #1503
Fri Mar 18 18:20:34 [conn1503] end connection 94.75.192.38:33980
Fri Mar 18 18:21:54 [conn1187] insert crawler.ad 278ms
Fri Mar 18 18:22:30 [conn5] query admin.$cmd ntoreturn:1 command: { writebacklisten: ObjectId('4d7f606edff5ce55c0758754') } reslen:60 300047ms
Fri Mar 18 18:22:39 [LockPinger] dist_lock pinged successfully for: OBE017.local:1300468689:1182095420
Fri Mar 18 18:22:52 [conn1190] insert crawler.ad 154ms
...
Then, for some time (I think until mongos restart, but not sure) keeps saying this mixed with normal logs:

Fri Mar 18 18:46:45 [conn1529] request split points lookup for chunk crawler.ad { : 10662019454956 } -->> { : MaxKey }
Fri Mar 18 18:46:47 [conn1529] request split points lookup for chunk crawler.ad { : 10662019454956 } -->> { : MaxKey }
Fri Mar 18 18:46:47 [conn1529] request split points lookup for chunk crawler.ad { : 10662019454956 } -->> { : MaxKey }
Fri Mar 18 18:46:49 [conn1529] request split points lookup for chunk crawler.ad { : 10662019454956 } -->> { : MaxKey }
...

But looks like no error appears anywhere on mongod logs (No error neither on config server logs)

I'll try to use 1.8.0 instead of RC2, let's see.

Post you if any change...


Alvin Richards

unread,
Mar 21, 2011, 8:52:56 AM3/21/11
to mongodb-user
Opened

http://jira.mongodb.org/browse/SERVER-2805

to track this issue. Let us know what happens with 1.8.0

-Alvin

briank

unread,
Apr 26, 2011, 1:54:53 PM4/26/11
to mongodb-user
I am still seeing this issue running 1.8.1 on all components. The
associated Jira issue doesn't have a Fix version currently - any plans
to get this in soon? Or are there any workarounds? It is causing me a
lot of pain during a bulk load.

--Brian

On Mar 21, 7:52 am, Alvin Richards <al...@10gen.com> wrote:
> Opened
>
> http://jira.mongodb.org/browse/SERVER-2805
>
> to track this issue. Let us know what happens with 1.8.0
>
> -Alvin
>
> On Mar 21, 12:36 pm, Marc Gracia <mgbu...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Related mongos log of one of the latest fail:
>
> > Fri Mar 18 18:05:05 [LockPinger] dist_lock pinged successfully for:
> > OBE017.local:1300448993:1804289383
> > Fri Mar 18 18:06:06 [mongosMain] connection accepted from
> > 62.212.84.220:60590 #34
> > Fri Mar 18 18:10:05 [LockPinger] dist_lock pinged successfully for:
> > OBE017.local:1300448993:1804289383
> > Fri Mar 18 18:15:05 [LockPinger] dist_lock pinged successfully for:
> > OBE017.local:1300448993:1804289383
> > Fri Mar 18 18:18:09 [conn24] autosplitted crawler.ad shard: ns:crawler.ad
> > at: partitionB:partitionB/flash,gordon:27018 lastmod: 138|0 min: { _id:
> > 662009843887 } max: { _id: 662009883156 } on: { _id: 10000068480361}(splitThreshold 209715200)
>
> > Fri Mar 18 18:18:09 [conn24] update failed b/c of StaleConfigException,
> > retrying  left:4 ns: crawler.adquery: { _id: 662009274178 }
> > Fri Mar 18 18:18:11 [conn28] ns: crawler.ad ClusteredCursor::query
> >ShardConnectionhadtochangeattempt: 0
> >ShardConnectionhadtochangeattempt: 0
> > Fri Mar 18 18:18:09 [conn1498]queryadmin.$cmd ntoreturn:1 command: {
> > Fri Mar 18 18:18:09 [conn1498]queryadmin.$cmd ntoreturn:1 command: {
> > splitChunk: "crawler.ad", keyPattern: { _id: 1.0 }, min: { _id: 662009843887}, max: { _id: 10000068480361 }, from: "partitionB/flash,gordon:27018",
>
> > splitKeys: [ { _id: 662009883156 } ], shardId:
> > "crawler.ad-_id_662009843887", configdb:
> > "flash:27019,gordon:27019,mara:27019" } reslen:53 487ms
> > Fri Mar 18 18:19:00 [conn1191] insert crawler.ad 390ms
> > Fri Mar 18 18:19:24 [conn159]queryadmin.$cmd ntoreturn:1 command: {
> > writebacklisten: ObjectId('4d249c0f2daa327869f816c7') } reslen:60 300048ms
> > Fri Mar 18 18:19:24 [conn163]queryadmin.$cmd ntoreturn:1 command: {
> > writebacklisten: ObjectId('4d249b1d816216524426d17b') } reslen:60 300048ms
> > Fri Mar 18 18:19:56 [conn1183]queryadmin.$cmd ntoreturn:1 command: {
> > writebacklisten: ObjectId('4d8346e06ab0f7d672b0b2f2') } reslen:60 300048ms
> > Fri Mar 18 18:20:25 [initandlisten] connection accepted from
> > 94.75.192.38:33966 #1502
> > Fri Mar 18 18:20:25 [conn1502] end connection 94.75.192.38:33966
> > Fri Mar 18 18:20:34 [initandlisten] connection accepted from
> > 94.75.192.38:33980 #1503
> > Fri Mar 18 18:20:34 [conn1503] end connection 94.75.192.38:33980
> > Fri Mar 18 18:21:54 [conn1187] insert crawler.ad 278ms
> > Fri Mar 18 18:22:30 [conn5]queryadmin.$cmd ntoreturn:1 command: {
> > writebacklisten: ObjectId('4d7f606edff5ce55c0758754') } reslen:60 300047ms
> > Fri Mar 18 18:22:39 [LockPinger] dist_lock pinged successfully for:
> > OBE017.local:1300468689:1182095420
> > Fri Mar 18 18:22:52 [conn1190] insert crawler.ad 154ms
> > ...
> > Then, for some time (I think until mongos restart, but not sure) keeps
> > saying this mixed with normal logs:
>
> > Fri Mar 18 18:46:45 [conn1529] request split points lookup for chunk
> > crawler.ad { : 10662019454956 } -->> { : MaxKey }
> > Fri Mar 18 18:46:47 [conn1529] request split points lookup for chunk
> > crawler.ad { : 10662019454956 } -->> { : MaxKey }
> > Fri Mar 18 18:46:47 [conn1529] request split points lookup for chunk
> > crawler.ad { : 10662019454956 } -->> { : MaxKey }
> > Fri Mar 18 18:46:49 [conn1529] request split points lookup for chunk...
>
> read more »

Gaetan Voyer-Perrault

unread,
Apr 26, 2011, 2:34:34 PM4/26/11
to mongod...@googlegroups.com
I'm not sure that issue was ever clearly identified and documented.

Would you be able to provide some log files for exactly what you're seeing? Log files should be located on the server running the mongoD process. The path of the log is defined by --logpath which is set at startup.

For ease of reading, please post up the log files on a paste site like gist.github.compastie.org, pastebin.com.

> It is causing me a lot of pain during a bulk load.

Would you also be able to provide clarification on this bulk loading.
 - How much data is being bulk loaded (relative to DB size)? How long is it taking?
 - Are you pre-splitting keys so that data is split evenly amongst the shards?
 - Are you turning off the balancer during the bulk loading?

- Gates

Eliot Horowitz

unread,
Apr 26, 2011, 2:35:17 PM4/26/11
to mongod...@googlegroups.com
Can you send the exact message you're seeing.

I think its probably this issue: https://jira.mongodb.org/browse/SERVER-2961

briank

unread,
Apr 27, 2011, 10:27:51 AM4/27/11
to mongod...@googlegroups.com
This is the couple relevant lines from my mongos logs:
           20:52:56 [conn909] autosplitted coredata.clips shard: ns:coredata.clips at: Shard B:RS2/ip-10-78-105-252.ec2.internal:27018 lastmod: 85|3 min: { t: 11414 } max: { t: 11527 } on: { t: 11788 }(splitThreshold 67108864)
           20:52:56 [conn923] update failed b/c of StaleConfigException, retrying  left:4 ns: coredata.clips query: { _id: 36945, t: 17, cu.i: 902 }
           20:52:57 [conn923] AssertionException in process: ns: coredata.clips ClusteredCursor::query ShardConnection had to change

It then throws an equivalent exception in my C# code. 

I can dig through my mongod logs if that would help also.

To answer your other questions:
1. I'm bulk loading about 30M rows (~15GB) into a sharded cluster with two shards and each is a RS of 3 members. During this process I was also running a upgrade script to modify a small percentage of the documents data to upgrade to a new version. The updates during the bulk load seemed to cause the problem. I realize running updates during a bulk load is uncommon - but I was more worried about updates hitting our cluster during a normal high usage scenario and triggering the same exception.
2. We did not pre-split but when I started running the update script our data was about 95% loaded - so splitting should have been at a minimum at that point. (But I do realize this exception is directly related to splitting of chunks)
3. We did not turn off the balancer during the bulk load.

Thanks for your continued help.

Eliot Horowitz

unread,
Apr 27, 2011, 2:10:55 PM4/27/11
to mongod...@googlegroups.com
That message doesn't make sense alone.
Can you send an entire mongos log?

briank

unread,
Apr 27, 2011, 3:31:45 PM4/27/11
to mongod...@googlegroups.com
The entire log is quite large - but I've included a much more significant chunk. If you need more - just let me know.

here is an a chunk around the same time from on of the shard primaries:


Eliot Horowitz

unread,
Apr 27, 2011, 5:52:27 PM4/27/11
to mongod...@googlegroups.com
And what errors are you seeing in the client?

briank

unread,
Apr 27, 2011, 6:35:29 PM4/27/11
to mongod...@googlegroups.com
I don't have the exact error any more I'm afraid. But a MongoException was thrown while executing a Find with the approximate message like the last line in my mongos log:
" ns: coredata.clips ClusteredCursor::query ShardConnection had to change". This happened 5 or 6 times during the span of a couple hours executing about 50k queries and around 800k updates to documents.

I catch this exception, but I'm not sure how I should be handling it. Do I just retry? Or is something more fundamentally wrong? It doesn't seem like it should be thrown an exception up.

Eliot Horowitz

unread,
Apr 28, 2011, 1:45:35 AM4/28/11
to mongod...@googlegroups.com
You shouldn't get that message, but there are some variants, so the
exact message would be helpful.
Also, what kind of operation it was on, query, command, etc...

briank

unread,
Apr 28, 2011, 9:43:25 AM4/28/11
to mongod...@googlegroups.com
It was during a query operation. I would basically query out a list of documents (list size between 10 and 1000) then update each item with a Set on a single property. Before I put in exception checking around this call it crashed my migration job - I dug out the windows EventLog entry in case that can help. If you need the exact error message to continue diagnosing - I will be running another batch migration in a week or two so I can update this post then.

Application: <removed>.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: MongoDB.Driver.MongoQueryException
Stack:
   at MongoDB.Driver.Internal.MongoCursorEnumerator`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].GetFirst()
   at MongoDB.Driver.Internal.MongoCursorEnumerator`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Collections.Generic.List`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]..ctor(System.Collections.Generic.IEnumerable`1<System.__Canon>)
   at System.Linq.Enumerable.ToList[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](System.Collections.Generic.IEnumerable`1<System.__Canon>)
   at <removed>.Program.AddCutupOrders(System.String, Int64, Int64)
   at <removed>.Program.Main(System.String[])

Reply all
Reply to author
Forward
0 new messages