pre-splitting and moveChunk on empty database

129 views
Skip to first unread message

Peter Uhrig

unread,
Aug 12, 2011, 8:45:36 PM8/12/11
to mongodb-user
Dear all,

I tried to do some pre-splitting in order to speed up insertion (and
it really helped a lot!). So I decided to split the sharded collection
into 4,096 chunks (which is a lot for my initial import but I'm
relatively sure I will need them eventually).
So I used a modified version of the Perl script found here
http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
to create a .js file the top of which looks like this:

admin = db.getSisterDB('admin');
admin.db.runCommand({split: "treebank.satz", middle: { _id:
"100000000000000000000000000000" } });
admin.db.runCommand({moveChunk: "treebank.satz", find: { _id:
"100000000000000000000000000000" }, to: "shard0000"});
admin.db.runCommand({split: "treebank.satz", middle: { _id:
"200000000000000000000000000000" } });
admin.db.runCommand({moveChunk: "treebank.satz", find: { _id:
"200000000000000000000000000000" }, to: "shard0001"});

However the whole process takes about 75 minutes on 4 pretty fast
machines with incredibly fast (Infiniband) network on a completely
empty database!
Although that is not a lot of time compared to what it saves me for
the inserts, so it does not bother me all that much, I was just
wondering why moving empty chunks from one completely bored machine to
another takes so long. (If I naively compare it to moving empty files,
it should not take more than a few seconds.)

Or is 4,096 chunks a bit too much and I should go with 256 instead and
have MongoDB auto-split when necessary?

Thanks a lot,
Peter

Mathias Stearn

unread,
Aug 15, 2011, 4:08:16 PM8/15/11
to mongodb-user
I think I found the issue. We sleep for one second on each moveChunk
to wait for completion since we poll for doneness. 4096 seconds is
very close to 75 minutes (4500 seconds) so that is where the vast
majority of the time is spent.

I created a jira to try to fix this: https://jira.mongodb.org/browse/SERVER-3602

In the mean time if you need to presplit again, you can directly
modify the config.chunks collection if you are careful. I wouldn't
recommend messing with it if you already have data in there, but there
is no risk if you don't. Just be sure to make the lastMod field unique
for all chunks (may not be possible from javascript).

On Aug 12, 8:45 pm, Peter Uhrig <peter.uh...@googlemail.com> wrote:
> Dear all,
>
> I tried to do some pre-splitting in order to speed up insertion (and
> it really helped a lot!). So I decided to split the sharded collection
> into 4,096 chunks (which is a lot for my initial import but I'm
> relatively sure I will need them eventually).
> So I used a modified version of the Perl script found herehttp://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-d...

Peter Uhrig

unread,
Aug 16, 2011, 9:15:28 AM8/16/11
to mongodb-user
Thanks a lot! I have decided, at least for my tests, to stick with 256
chunks for now.
Reply all
Reply to author
Forward
0 new messages