Dear all,
I tried to do some pre-splitting in order to speed up insertion (and
it really helped a lot!). So I decided to split the sharded collection
into 4,096 chunks (which is a lot for my initial import but I'm
relatively sure I will need them eventually).
So I used a modified version of the Perl script found here
http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
to create a .js file the top of which looks like this:
admin = db.getSisterDB('admin');
admin.db.runCommand({split: "treebank.satz", middle: { _id:
"100000000000000000000000000000" } });
admin.db.runCommand({moveChunk: "treebank.satz", find: { _id:
"100000000000000000000000000000" }, to: "shard0000"});
admin.db.runCommand({split: "treebank.satz", middle: { _id:
"200000000000000000000000000000" } });
admin.db.runCommand({moveChunk: "treebank.satz", find: { _id:
"200000000000000000000000000000" }, to: "shard0001"});
However the whole process takes about 75 minutes on 4 pretty fast
machines with incredibly fast (Infiniband) network on a completely
empty database!
Although that is not a lot of time compared to what it saves me for
the inserts, so it does not bother me all that much, I was just
wondering why moving empty chunks from one completely bored machine to
another takes so long. (If I naively compare it to moving empty files,
it should not take more than a few seconds.)
Or is 4,096 chunks a bit too much and I should go with 256 instead and
have MongoDB auto-split when necessary?
Thanks a lot,
Peter