Sharding not balacing after database drop and recreate: "couldn't find index over splitting key"

553 views
Skip to first unread message

Theo

unread,
Jul 6, 2011, 4:17:57 PM7/6/11
to mongodb-user
I have a sharded database with sharded collections. Each collection is
sharded on {split_key: 1, _id: 1} (split_key is a hash of a part of
the ID, the ID is monotonically increasing). Today I dropped the
database to clear out old data we didn't need, and recreated it again
with the same name, the same collections, the same indexes and
sharding configuration (most of the sharding commands said that the
database and collections were already sharded -- seems like sharding
configuration is not removed when you drop a database).

Now when I insert data it is not balancing between the servers, most
of the data ends up on one of the machines. I assume that the
configuration or some other part of Mongo's internal bookkeeping got
corrupted when I dropped the database (just the fact that the sharding
configuration is not cleared out when a database is dropped makes we
really worried).

The mongos logs contains messages similar to this, from time to time:

Wed Jul 6 20:03:15 [conn21] autosplitted fragments.exposure_fragments
shard: ns:fragments.exposure_fragments at: rfmshard3:rfmshard3/
rfmcolldb04:27017,rfmcolldb06:27017,rfmcolldb05:27017 lastmod: 12|5
min: { split_key: 0, _id: "LNWP8PBPJRQ0E74E000024" } max: { split_key:
11, _id: "MFHLF043BAPB0E9E000001" } on: { split_key: 0, _id:
"LNWR0ET62V6CED01000000" }(splitThreshold 67108864)
Wed Jul 6 20:03:15 [conn21] warning: could have autosplit on
collection: fragments.parsefail_fragments but: splitVector command
failed: { errmsg: "couldn't find index over splitting key", ok: 0.0 }
Wed Jul 6 20:03:15 [conn27] ns: fragments.pageview_fragments
ClusteredCursor::query ShardConnection had to change attempt: 0
Wed Jul 6 20:03:15 [conn21] warning: could have autosplit on
collection: fragments.error_fragments but: splitVector command failed:
{ errmsg: "couldn't find index over splitting key", ok: 0.0 }

looks to me like it's complaining about there being no index for the
shard key, but there is one:

> db.parsefail_fragments.getIndexes()
[
{
"name" : "_id_",
"ns" : "fragments.parsefail_fragments",
"key" : {
"_id" : 1
},
"v" : 0
},
{
"_id" : ObjectId("4e142304d0da24f265b6ce4d"),
"ns" : "fragments.parsefail_fragments",
"key" : {
"split_key" : 1,
"_id" : 1
},
"name" : "split_key_1__id_1",
"v" : 0
}
]

we run mongod 1.8.1 and mongos 1.8.2-rc2 (because 1.8.1 had way to
many sharding bugs -- yes, I'm sure we could upgrade to 1.8.2). we
have four shards with three replicas each running on six quite big EC2
instances (the exact specs escape me).

T#

David Tollmyr

unread,
Jul 7, 2011, 8:26:19 AM7/7/11
to mongodb-user
Hi. I'm David, Theos colleague.

We've done some further digging and have discovered what seems an
inconsistency in the indexes.
All our collections report correct indexes when we call getIndexes(),
like you see in Theos post.
However, if we call db.printCollectionStats() all but one collection
only report the _id index.

> db.printCollectionStats()
exposure_fragments
{
"sharded" : true,
"ns" : "fragments.exposure_fragments",
"count" : 44364015,
"size" : 15476136064,
"avgObjSize" : 348.84435198211884,
"storageSize" : 19526777856,
"nindexes" : 2,
"nchunks" : 374,
"shards" : {
"rfmshard1" : {
"ns" : "fragments.exposure_fragments",
"count" : 119,
"size" : 42468,
"avgObjSize" : 356.87394957983196,
"storageSize" : 172032,
"numExtents" : 3,
"nindexes" : 2,
"lastExtentSize" : 131072,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 16384,
"indexSizes" : {
"_id_" : 8192,
"split_key_1__id_1" : 8192
},
"ok" : 1
},
"rfmshard3" : {
"ns" : "fragments.exposure_fragments",
"count" : 26476343,
"size" : 9256036616,
"avgObjSize" : 349.5964913281264,
"storageSize" : 12651963392,
"numExtents" : 35,
"nindexes" : 2,
"lastExtentSize" : 2115592192,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 4382312320,
"indexSizes" : {
"_id_" : 2459198400,
"split_key_1__id_1" : 1923113920
},
"ok" : 1
},
"rfmshard4" : {
"ns" : "fragments.exposure_fragments",
"count" : 17887553,
"size" : 6220056980,
"avgObjSize" : 347.73101608699636,
"storageSize" : 6874642432,
"numExtents" : 33,
"nindexes" : 2,
"lastExtentSize" : 1152296704,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 2951022464,
"indexSizes" : {
"_id_" : 1393304512,
"split_key_1__id_1" : 1557717952
},
"ok" : 1
}
},
"ok" : 1
}

pageview_fragments
{
"sharded" : true,
"ns" : "fragments.pageview_fragments",
"count" : 64577,
"size" : 14881900,
"avgObjSize" : 230.45201852052588,
"storageSize" : 3414381568,
"nindexes" : 1,
"nchunks" : 1,
"shards" : {
"rfmshard3" : {
"ns" : "fragments.pageview_fragments",
"count" : 64577,
"size" : 14881900,
"avgObjSize" : 230.45201852052588,
"storageSize" : 3414381568,
"numExtents" : 27,
"nindexes" : 1,
"lastExtentSize" : 578848512,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 5070848,
"indexSizes" : {
"_id_" : 5070848
},
"ok" : 1
}
},
"ok" : 1
}

> db.exposure_fragments.getIndexes() [ {
"name" : "_id_",
"ns" : "fragments.exposure_fragments",
"key" : {
"_id" : 1
},
"v" : 0
},
{
"_id" : ObjectId("4e1422f8d0da24f265b6ce4a"),
"ns" : "fragments.exposure_fragments",
"key" : {
"split_key" : 1,
"_id" : 1
},
"name" : "split_key_1__id_1",
"v" : 0
}
]

Next step is to try and rebuild all indexes and see if that helps.

/David

Greg Studer

unread,
Jul 7, 2011, 11:36:54 PM7/7/11
to mongod...@googlegroups.com
> sharding configuration (most of the sharding commands said that the
> > database and collections were already sharded -- seems like sharding
> > configuration is not removed when you drop a database).

Think you're running into https://jira.mongodb.org/browse/SERVER-2253.
The sharding config is cleared at first, but the db entry is
re-inserted. Workaround is to do:
config.databases.remove({ _id : <database> }) after you drop a sharded
db. Fix may be simple to backport, looking into this as well for
1.9.2.

> All our collections report correct indexes when we call getIndexes(),
> like you see in Theos post.
> However, if we call db.printCollectionStats() all but one collection
> only report the _id index.

getIndexes() pulls from the <database>.system.indexes metadata
collection, which seems to have stale data. printCollectionStats() is
reporting the true values from the collstats command run on the servers.
Explicitly dropping and re-creating the index should work.

David Tollmyr

unread,
Jul 8, 2011, 3:53:20 AM7/8/11
to mongodb-user
Rebuilding the indexes helped somewhat. Now our collections are
sharded evenly on shards 3 and 4. The "couldn't find index over
splitting key" errors are gone.
But for some reason shards 1 and 2 are hardly being utilized at all.
All servers are up and responding properly.

Ideas?

--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{
"_id" : "rfmshard1",
"host" : "rfmshard1/
rfmcolldb03:27017,rfmcolldb02:27017,rfmcolldb01:27017"
}
{
"_id" : "rfmshard2",
"host" : "rfmshard2/
rfmcolldb01:27117,rfmcolldb03:27117,rfmcolldb02:27117"
}
{
"_id" : "rfmshard3",
"host" : "rfmshard3/
rfmcolldb04:27017,rfmcolldb06:27017,rfmcolldb05:27017"
}
{
"_id" : "rfmshard4",
"host" : "rfmshard4/
rfmcolldb04:27117,rfmcolldb06:27117,rfmcolldb05:27117"
}


{ "_id" : "complete", "partitioned" : true, "primary" : "rfmshard4" }
complete.exposures chunks:
rfmshard3 162
rfmshard4 396
rfmshard2 396
too many chunksn to print, use verbose if you want to force print
complete.pageviews chunks:
rfmshard3 1
rfmshard4 18
rfmshard2 1
too many chunksn to print, use verbose if you want to force print


fragments.exposure_fragments chunks:
rfmshard4 344
rfmshard3 344
rfmshard1 1


On 8 Juli, 05:36, Greg Studer <g...@10gen.com> wrote:
> > sharding configuration (most of the sharding commands said that the
> > > database and collections were already sharded -- seems like sharding
> > > configuration is not removed when you drop a database).
>
> Think you're running intohttps://jira.mongodb.org/browse/SERVER-2253.

Greg Studer

unread,
Jul 8, 2011, 9:41:49 AM7/8/11
to mongod...@googlegroups.com
Something may interfering with balancing / migrations - you can check
the status of balancing by running a <logfile> | grep "[balancer]" on
the mongos logs, migrations by grepping "migrate". You can also turn on
more verbose logging temporarily with adminCommand( { "setParameter" :
1 , logLevel : 1 } ).

Another thing to try - does config.locks.find({ _id : "balancer",
state : 1 }) return the same document when you run it twice 20 mins
apart?

David Tollmyr

unread,
Jul 8, 2011, 11:48:10 AM7/8/11
to mongodb-user
We're seeing a few "dist_lock error trying to aquire lock" errors in
the logs.

And this balancer lock seems to have been set for a long time:
{ "_id" : "balancer", "process" : "rfmstaging01.byburt.com:
1305191764:1804289383", "state" : 1, "ts" :
ObjectId("4e16c7346c1633164bdf40ce"), "when" :
ISODate("2011-07-08T09:00:36.992Z"), "who" : "rfmstaging01.byburt.com:
1305191764:1804289383:Balancer:846930886", "why" : "doing balance
round" }
If the when timestamp is correct that's more than 8 hours ago.
> ...
>
> läs mer »

Greg Studer

unread,
Jul 8, 2011, 1:19:57 PM7/8/11
to mongod...@googlegroups.com
The balancer lock needs to be forced manually - I'm guessing your
connectivity was interrupted during several unlock retries but the
process is still active and pinging, which requires a manual force (in
1.8).

You'll want to double-check that the lock ts value hasn't changed, and
then do a config.settings.update({ _id : "balancer", ts : <ts> },
{ $set : { state : 0 } }).

David Tollmyr

unread,
Jul 11, 2011, 8:44:17 AM7/11/11
to mongodb-user
Released the balancer lock this morning. I now see new balancer tasks
show up regularly but i'm not convinced it does something. It's been
four hours and i see now apparent change in the sharding ratios. I
still see numbers like these:
complete_backup.exposures chunks:
rfmshard3 488
rfmshard4 488

complete.exposures chunks:
rfmshard3 358
rfmshard4 888
rfmshard2 887

fragments.exposure_fragments chunks:
rfmshard4 765
rfmshard3 758
rfmshard1 1

I also see messages like this show up in the mongos logs.
Mon Jul 11 08:07:12 [conn38] warning: splitChunk failed - cmd:
{ splitChunk: "fragments.exposure_fragments", keyPattern: { split_key:
1, _id: 1 }, min: { split_key: 11, _id: "LO5PEHZZ7EFNA23C000001" },
max: { split_key: 11, _id: "MFHLF043BAPB0E9E000001" }, from:
"rfmshard3/rfmcolldb04:27017,rfmcolldb06:27017,rfmcolldb05:27017",
splitKeys: [ { split_key: 11, _id: "LO5R3OXXQNVN956A000000" } ],
shardId: "fragments.exposure_fragments-
split_key_11_id_"LO5PEHZZ7EFNA23C000001"", configdb:
"rfmcolldb01:28100,rfmcolldb02:28100,rfmcolldb03:28100" } result:
{ who: { _id: "fragments.exposure_fragments", process:
"rfmcolldb05.byburt.com:1309270458:2098825282", state: 1, ts:
ObjectId('4e1aaddb65bb20739ffdb605'), when: new Date(1310371291616),
who: "rfmcolldb05.byburt.com:
1309270458:2098825282:conn1614:1797913900", why: "migrate-{ split_key:
5, _id: "LO05R9T61FC5C1DF000001" }" }, errmsg: "the collection's
metadata lock is taken", ok: 0.0 }

Ideas?
> ...
>
> läs mer »

Greg Studer

unread,
Jul 11, 2011, 1:19:26 PM7/11/11
to mongod...@googlegroups.com
Yeah, think you've got another stale lock - the collection metadata
lock.

can you send config.locks.find({ state : 1 }) - looking for all active
locks that don't change in ~20 mins.

David Tollmyr

unread,
Jul 11, 2011, 2:13:49 PM7/11/11
to mongodb-user
Sadly, the only active lock seems to be the balancer, and it updates
often so it's not stale.

Any other ideas on where i could look? Nothing pops out anymore.

None of the databases seem to get properly distributed.
And this it quite weird:
{ "_id" : "fragments", "partitioned" : true, "primary" : "rfmshard2" }

Shard 2 is primary, but there is absolute no utilization of shard2 for
that db.

.d
> ...
>
> läs mer »

Greg Studer

unread,
Jul 11, 2011, 2:44:09 PM7/11/11
to mongod...@googlegroups.com
That's strange, somehow the mongos is finding the lock document { _id:

"fragments.exposure_fragments", process:
"rfmcolldb05.byburt.com:1309270458:2098825282", state: 1, ts:
ObjectId('4e1aaddb65bb20739ffdb605') }

The lock data on different config servers may be inconsistent - can you
log in individually to each and check config.locks()?

What do you mean by absolutely no utilization? Just that there are no
collection chunks on that shard for that db?

David Tollmyr

unread,
Jul 11, 2011, 3:34:17 PM7/11/11
to mongod...@googlegroups.com
Lock data seems to be consistent on all config servers. :/
No "fragments" collection shards on shard2, despite that shard being the primary one.

.d
rfmshard2 8877
rfmsshard2 396


--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Greg Studer

unread,
Jul 11, 2011, 4:13:44 PM7/11/11
to mongod...@googlegroups.com
Just to confirm - you're at this moment getting repeated splitChunk ...
errmsg: "the collection's metadata lock is taken", ok: 0.0 } messages in
your mongos log, which include the competing lock data, but config.locks
is empty on all three config servers? Was this an older message, and
are there newer error messages now blocking balancing?

Also, is at least some balancer still running in a mongos process?

Primary refers to the database, not to the collection, it's the shard on
which new (unsharded) collections will be created, so don't think this
is necessarily an issue. You can test this by creating a new collection
for that db.

> > +unsub...@googlegroups.com.


> > For more options, visit this group at
> > http://groups.google.com/group/mongodb-user?hl=en.
> >
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user

> +unsub...@googlegroups.com.

David Tollmyr

unread,
Jul 11, 2011, 5:19:01 PM7/11/11
to mongod...@googlegroups.com
The "collection metadata lock" messages are not as common as i thought. I see about 6-8 in the past 12 hours.
As far as balancer goes i see this regularly but not much else:
Mon Jul 11 20:29:54 [Balancer] SyncClusterConnection connecting to [rfmcolldb03:28100]

I see a lot of these if they're relevant:
Mon Jul 11 21:14:52 [conn35] ns: fragments.pageview_fragments ClusteredCursor::query ShardConnection had to change attempt: 0
> <g....@10gen.com> wrote:
> > Something may interfering with balancing /
> > migrations - you can check
> >; the status of balancing by running a
> &> <logfile> | grep "[balancer]" on

> > the mongos logs, migrations by grepping
> > "migrate". You can also turn on
> > more verbose logging temporarily with
> > adminCommand( { "setParameter" :
> > 1 , logLevel : 1 } ).
> >

> Another thing to try - does
> config.locks.find({ _id :
> "balancer",
> state : 1 }}}) return the same
> document when you run it twice 20
> miins
> apart??
>

> On FFri, 2011-07-08 at 00:53 -0700,

> David Tollmyr wrote:
> > Rebuilding the indexes helped
> > somewhat. Now our collections are
> > sharded evenly on shards 3 and 4.
> > The """couldn't find index over
> > splitting key"" errors are gone.
> > But for some reason shards 1 and 2
> > are hardly being utilized at all.
> > All servers are up and responding
> > properly.
> >

> Ideas??
>

> --- Sharding SStatus ---

> sharding version: {{ "_id" :
> 1, ""version" : 3 }
> shards:
&gtt; {{

> "_id" : "rfmshard1",
> "host" : "rfmshard1/
> rfmcolldb03:27017,rfmcolldb02:27017,rfmcolldb01:27017"
> }}
> {{{
> """_id" : "rfmshard2",
> """host" : "rfmshard2/
> rfmcolldb01:27117,rfmcolldb03:27117,rfmcolldbb02:27117"

> }}
> {{
> "_id" : "rfmshard3",
> "host" : "rfmshard3/
> rfmcolldb04:27017,rfmcolldb06:27017,rfmcollldb05:27017"

> }}
> {{
> ""_id" : "rfmshard4",
> ""host" : "rfmshard4/
> rfmcolldb04:27117,rfmcolldb06:27117,rfmcolldb05:271117"
> }}
>

> {{{ "_id" : "complete",

> ""partitioned" : true,
> ""primary" : "rfmshard4" }
> complete.exposures chunks:
> rfmshard3 162
> rfmshard4 396
> rfmsshard2 396
&> too many chunksn to print,

> use verbose if you want to
> force print
&ggt; complete.pageviews chunks:

> rfmshard3 1
> rfmshard4 18
> rfmshard2 1 too many chunksn
> to print, use verbose if you
> want to force print
>

> fragments.exposure___fragments
> chunks:
> rrfmshard4 344

> rfmshard3 344
> rfmshard1 1 >

>; On 8 Juli, 05:36,
> Greg Studer
> <g...@@10gen.com>
> wrote:
&ggt; > sharding
> > configuration (most of the sharding ccommands said that the
> > > database
> > > and
> > > collections were already sharded -- seems like sharding > > > configuration is not removed whhen you drop a database).

> > >

> Think you're
> running
> intohttps://jira.mongodb.org/browse/SERVER-2253.
> The sharding
> config is
> cleared at
> first, but
> the db entry
> is
&> re-inserted.
> Workaround
> is to do:
&> config.databases.remove({ _id : <database> }) after you drop a sharded

> db. Fix may
> be simple to
> backport,
> looking into
> this as well
> for
> 1.9.2.
>

> All
> our
&ggt; collections report correct indexes when we call ggetIndexes(),

> like
> you
> see
> in
> Theos post.
> However, if wee call db.printCollecctionStats() all but one collection
> only
> report the ___id index.

>

> getIndexes()
> pulls from
> the
> <database>.system.indexes metadata<
> collection,
> which seems
> to have
> stale data.
> printCollectionStats() is
> reporting
> the true
&> values from

> the
> collstats
> command run
> on the
> servers.
> Explicitly
> dropping and
> re-creating
> the index
> should work.
&ggt;

> On Thu,
> 2011-07-07
> at 05:26
> -0700, David
&gtt; Tollmyr

> wrote:
> > Hi. I'm
> > David,
> > Theos
> > colleague.
> >

> We've done some furtherr digging and have discovered wwhat seems an

> inconsistency in the indexes.
> All
> our
> collections report correct iindexes when we call getIndexess(),

> like
> you
> see
> in
> Theos post.
> However, if we call dbb.printCollectionStats() all but one collecction
> only
> report the __id index.
>

> > db.printCollecttionStats()
> >
> exposure__fragments
> {{{
l�äs mer »


--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user
+unsub...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/mongodb-user?hl=en.


--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user
+unsub...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/mongodb-user?hl=en.


--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Greg Studer

unread,
Jul 11, 2011, 11:07:13 PM7/11/11
to mongod...@googlegroups.com
Okay, makes sense, think that's not the cause of the current balancer
issue. If you turn on verbose logging, there should be more messages
telling you why the balancer has decided not to do anything - the
particular mongos whose balancer is active is in the balancer lock
document.

Guessing the "had to change attempt:0" indicates some connections
dropped/re-established at some point which may have made the situation
more confusing.

On Mon, 2011-07-11 at 23:19 +0200, David Tollmyr wrote:
> The "collection metadata lock" messages are not as common as i
> thought. I see about 6-8 in the past 12 hours.
> As far as balancer goes i see this regularly but not much else:
> Mon Jul 11 20:29:54 [Balancer] SyncClusterConnection connecting to
> [rfmcolldb03:28100]
>
>
> I see a lot of these if they're relevant:
> Mon Jul 11 21:14:52 [conn35] ns: fragments.pageview_fragments
> ClusteredCursor::query ShardConnection had to change attempt: 0
>

> > > > > > > > <database>.system.indexes metadata<<br> > collection,

> > > > > > > l妹 mer 禄

Greg Studer

unread,
Jul 11, 2011, 11:12:47 PM7/11/11
to mongod...@googlegroups.com
Sorry, it's late - not dropped/re-established but refreshing sharding
data.

On Mon, 2011-07-11 at 23:19 +0200, David Tollmyr wrote:

> The "collection metadata lock" messages are not as common as i
> thought. I see about 6-8 in the past 12 hours.
> As far as balancer goes i see this regularly but not much else:
> Mon Jul 11 20:29:54 [Balancer] SyncClusterConnection connecting to
> [rfmcolldb03:28100]
>
>
> I see a lot of these if they're relevant:
> Mon Jul 11 21:14:52 [conn35] ns: fragments.pageview_fragments
> ClusteredCursor::query ShardConnection had to change attempt: 0
>

> > > > > > > > <database>.system.indexes metadata<<br> > collection,

> > > > > > > l妹 mer 禄

David Tollmyr

unread,
Jul 12, 2011, 3:47:24 AM7/12/11
to mongod...@googlegroups.com
This is what i keep seeing if i set logLevel to 5 on the server running the balancer.

Tue Jul 12 07:36:20 [Balancer] dist_lock lock gotLock: 1 now: { _id: "balancer", process: "rfmcrunchclu02.byburt.com:1306337587:1804289383", state: 1, ts: ObjectId('4e1bf97445b5c0ccc283f9d5'), when: new Date(1310456180318), who: "rfmcrunchclu02.byburt.com:1306337587:1804289383:Balancer:846930886", why: "doing balance round" }
Tue Jul 12 07:36:20 [Balancer] *** start balancing round
Tue Jul 12 07:36:20 [Balancer] collection : fragments.exposure_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 892 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 892 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : fragments.pageview_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 11 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 11 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : complete.exposures
Tue Jul 12 07:36:20 [Balancer] donor      : 1015 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 1015 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : complete.pageviews
Tue Jul 12 07:36:20 [Balancer] donor      : 45 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 45 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : fragments.error_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 15 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 15 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : fragments.malformed_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 1 chunks on rfmshard3
Tue Jul 12 07:36:20 [Balancer] receiver   : 0 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : fragments.parsefail_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 5 chunks on rfmshard3
Tue Jul 12 07:36:20 [Balancer] receiver   : 5 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : fragments.visit_fragments
Tue Jul 12 07:36:20 [Balancer] donor      : 23 chunks on rfmshard3
Tue Jul 12 07:36:20 [Balancer] receiver   : 23 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : state.assembler_state
Tue Jul 12 07:36:20 [Balancer] donor      : 7 chunks on rfmshard3
Tue Jul 12 07:36:20 [Balancer] receiver   : 6 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : complete_backup.exposures
Tue Jul 12 07:36:20 [Balancer] donor      : 488 chunks on rfmshard3
Tue Jul 12 07:36:20 [Balancer] receiver   : 488 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] collection : complete_backup.pageviews
Tue Jul 12 07:36:20 [Balancer] donor      : 16 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] receiver   : 16 chunks on rfmshard4
Tue Jul 12 07:36:20 [Balancer] no need to move any chunk
Tue Jul 12 07:36:20 [Balancer] *** end of balancing round
Tue Jul 12 07:36:24 [LockPinger] dist_lock pinged successfully for: rfmcrunchclu02.byburt.com:1306337587:1804289383
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] checking replica set: rfmshard1
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: rfmcolldb02:27017 { setName: "rfmshard1", ismaster: true, secondary: false, hosts: [ "rfmcolldb02:27017", "rfmcolldb03:27017", "rfmcolldb01:27017" ], maxBsonObjectSize: 16777216, ok: 1.0 }
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] checking replica set: rfmshard2
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: rfmcolldb03:27117 { setName: "rfmshard2", ismaster: true, secondary: false, hosts: [ "rfmcolldb03:27117", "rfmcolldb02:27117", "rfmcolldb01:27117" ], maxBsonObjectSize: 16777216, ok: 1.0 }
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] checking replica set: rfmshard3
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: rfmcolldb05:27017 { setName: "rfmshard3", ismaster: true, secondary: false, hosts: [ "rfmcolldb05:27017", "rfmcolldb06:27017", "rfmcolldb04:27017" ], maxBsonObjectSize: 16777216, ok: 1.0 }
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] checking replica set: rfmshard4
Tue Jul 12 07:36:29 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: rfmcolldb06:27117 { setName: "rfmshard4", ismaster: true, secondary: false, hosts: [ "rfmcolldb06:27117", "rfmcolldb05:27117", "rfmcolldb04:27117" ], maxBsonObjectSize: 16777216, ok: 1.0 }
Tue Jul 12 07:36:30 [Balancer] dist_lock unlock: { _id: "balancer", process: "rfmcrunchclu02.byburt.com:1306337587:1804289383", state: 0, ts: ObjectId('4e1bf97445b5c0ccc283f9d5'), when: new Date(1310456180318), who: "rfmcrunchclu02.byburt.com:1306337587:1804289383:Balancer:846930886", why: "doing balance round" }

.d
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Greg Studer

unread,
Jul 12, 2011, 10:33:14 AM7/12/11
to mongod...@googlegroups.com
shard2/shard1 may have ops queued, or hit a maximum shard size - you can
check if there are writebacks queued using the serverStatus() command.
I assume you haven't set a maxSize for the shard, but you can
double-check this by looking at config.shards for 'maxSize' fields. Can
you send the config.shards.find() output in any case?

If stale writebacks are queued, you'll need to bounce the shard
primaries to remove them.

David Tollmyr

unread,
Jul 13, 2011, 2:37:11 AM7/13/11
to mongod...@googlegroups.com
Shard 1, 2 and 3 all have writeBacksQueued: true.

Shard config:
{ "_id" : "rfmshard1", "host" : "rfmshard1/rfmcolldb03:27017,rfmcolldb02:27017,rfmcolldb01:27017" }
{ "_id" : "rfmshard2", "host" : "rfmshard2/rfmcolldb01:27117,rfmcolldb03:27117,rfmcolldb02:27117" }
{ "_id" : "rfmshard3", "host" : "rfmshard3/rfmcolldb04:27017,rfmcolldb06:27017,rfmcolldb05:27017" }
{ "_id" : "rfmshard4", "host" : "rfmshard4/rfmcolldb04:27117,rfmcolldb06:27117,rfmcolldb05:27117" }

How would i go about bouncing the shard primaries best? replSetStepDown?

.d
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

David Tollmyr

unread,
Jul 13, 2011, 9:00:28 AM7/13/11
to mongod...@googlegroups.com
I did stepDowns on shard1 and 2 today, and it seems to have worked. I see the balancer moving data now and we're slowly moving towards a more evenly distribution. 
Once i see the numbers get a bit more even i'll know if i need to bounce shard3 as well.

Thanks a lot Greg for the help!

/David - Burt
shardinng > > > configuration is not removed whhen you
Reply all
Reply to author
Forward
0 new messages