Hi Abhi,
In my (limited) experience with large MongoDB databases...
- I've used a large JSON file as input for the mongoimport utility (included with MongoDB). The format of my JSON file was:
{"key1" : "value1", "key2" : "value2", ...},
{"key1" : "value3", "key2" : "value4", ...},
...
and I ran mongoimport using:
mongoimport -d myDatabaseName -c myCollectionName --file myLargeJSONfile.json
Using this approach, I was getting insert performance of ~23k records per second on my machine (regular SATA hard disk, 96GB RAM, openSUSE Linux, MongoDB 2.4.3). This was with the JSON file located on the same machine and same disk as the dbpath, so no network traffic/overhead involved.
- The "_id" field (unique document identifier) index is automatically generated and updated with each new document inserted into the collection, but I've found that having any additional indexes significantly slows down insert performance. If your goal is to input all your data into MongoDB as fast as possible (on a single mongod instance), I would recommend bulk inserting with something like mongoimport, followed by index generation with db.collection.ensureIndex -- this way the index isn't updated with each new document added to the database
- If you're interested in speeding things up a little bit more, I've tried using multiple mongoimport processes in parallel with some success. Because of database-level write locking during import, after a certain number of mongoimport (or other write) processes, you will just have a lot of documents queued for writing waiting for locks to be released, which will end up slowing things down. The easiest way I've found to do this is using the GNU Parallel utility (
http://www.gnu.org/software/parallel/ ). Running two mongoimport processes in parallel (on the same machine) gives me a combined import rate of around 30k records/second. Not a huge speedup, but could save some time if you're importing a lot of data.
- This is all assuming you are using a single machine (single mongod instance), and not a sharded cluster. I don't have much experience with sharding, but this looks like a good starting point for getting data into MongoDB (alongside the official sharding tutorial, of course):
Hope this helps!
Sandip