Test Dataset for MongoDB

2,034 views
Skip to first unread message

Abhi

unread,
Apr 21, 2014, 9:03:32 AM4/21/14
to mongod...@googlegroups.com
Hi,
I am looking for test datasets which can be ported into MongoDB for testing purposes. I have looked at this data set http://media.mongodb.org/zips.json but i need one with millions of documents. 
The dataset can be on any theme like employee details/ customer data/ sales data/ e-commerce data/ orders-supplies etc. 

If there is no dataset available, how to generate one? Are there any opensource tools to generate test data for mongodb? 

Thanks,
Abhi

Sandip Chatterjee

unread,
Apr 21, 2014, 11:21:33 AM4/21/14
to mongod...@googlegroups.com
What do you want to use this test dataset for?  You could always generate a large dataset programmatically using random numbers, etc.

Easiest way I can think of is using a dataset like MovieLens ( http://grouplens.org/datasets/movielens/ ) and reformatting it into JSON, then running mongoimport to efficiently get it into MongoDB.

Asya Kamsky

unread,
Apr 21, 2014, 7:26:32 PM4/21/14
to mongodb-user
You can just google JSON large dataset - or generate your own.

One example I've used is the enron DB (or email messages).
There is also http://labrosa.ee.columbia.edu/millionsong/lastfm which is songs with tags.

You can also capture twitter through their API - all the tweets are streamed as JSON which you can then import into your MongoDB.

Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/057e57e2-b1ae-4a3f-8c30-027c87639093%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abhi

unread,
Apr 23, 2014, 10:13:07 AM4/23/14
to mongod...@googlegroups.com
Hi,
Thanks for the reply Asya and sandip.

I looked at the datasets and decied to generate my own test data as the datasets available were not satisfying the relationships that i require in my documents.

I want to generate the dataset to reach a database size of 1 TB. I will be generating json documents with random values and I am planning to use javascript on server side to do this. I am not sure which approach is better to populate the database?

1. generate the json document and insert it.
2. generate a batch of documents and use batch insert. If yes what is the ideal batch size for batch inserts?

Asya,Sandip how did you populated your test database? What is the recommended approach for this task? Any suggestions or pitfalls to avoid while doing this.


Thanks,
Abhi

Sandip Chatterjee

unread,
Apr 23, 2014, 3:07:20 PM4/23/14
to mongod...@googlegroups.com
Hi Abhi,

In my (limited) experience with large MongoDB databases...

- I've used a large JSON file as input for the mongoimport utility (included with MongoDB).  The format of my JSON file was:
{"key1" : "value1", "key2" : "value2", ...},
{"key1" : "value3", "key2" : "value4", ...},
...

and I ran mongoimport using:
mongoimport -d myDatabaseName -c myCollectionName --file myLargeJSONfile.json

Using this approach, I was getting insert performance of ~23k records per second on my machine (regular SATA hard disk, 96GB RAM, openSUSE Linux, MongoDB 2.4.3).  This was with the JSON file located on the same machine and same disk as the dbpath, so no network traffic/overhead involved.

- The "_id" field (unique document identifier) index is automatically generated and updated with each new document inserted into the collection, but I've found that having any additional indexes significantly slows down insert performance.  If your goal is to input all your data into MongoDB as fast as possible (on a single mongod instance), I would recommend bulk inserting with something like mongoimport, followed by index generation with db.collection.ensureIndex -- this way the index isn't updated with each new document added to the database

- If you're interested in speeding things up a little bit more, I've tried using multiple mongoimport processes in parallel with some success.  Because of database-level write locking during import, after a certain number of mongoimport (or other write) processes, you will just have a lot of documents queued for writing waiting for locks to be released, which will end up slowing things down.  The easiest way I've found to do this is using the GNU Parallel utility ( http://www.gnu.org/software/parallel/ ).  Running two mongoimport processes in parallel (on the same machine) gives me a combined import rate of around 30k records/second.  Not a huge speedup, but could save some time if you're importing a lot of data.

- This is all assuming you are using a single machine (single mongod instance), and not a sharded cluster.  I don't have much experience with sharding, but this looks like a good starting point for getting data into MongoDB (alongside the official sharding tutorial, of course): 

Hope this helps!

Sandip

Abhi

unread,
Apr 24, 2014, 4:01:28 AM4/24/14
to mongod...@googlegroups.com
Thanks for the suggestions sandip. How did you measured the insert performance while using mongoinsert? using mongostat or any other utility?

I don't already have any dataset so i was thinking to use javascript on server side (using mongo shell) to generate json documents and insert them into mongodb. For this, should i use batch inserts or single document inserts? will there be any difference in performance?

Thanks,
Abhi

Sandip Chatterjee

unread,
Apr 24, 2014, 12:42:50 PM4/24/14
to mongod...@googlegroups.com
Mongoimport prints the import rate every few seconds to the screen (STDOUT).

I don't have any experience generating or importing records through the JavaScript mongo shell, but I suspect that batch imports will be more efficient.

Sandip

Asya Kamsky

unread,
Apr 24, 2014, 5:11:34 PM4/24/14
to mongodb-user
I would recommend NOT using json files or server-side javascript.

I would recommend generating documents and then inserting them in your code directly, and I would recommend using many threads or many clients to do this in parallel since data generation can become the bottleneck, rather than the actual writing.

Here's one example from the shell (using 2.6):

for (i=0;i<1000;i++) { docs.push({x: Math.random(), y:"abcxyz", foo:["bar",{baz:1}]}) }; null; db.new.insert(docs);
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 1000,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})

If I was doing this inside a loop *AND* running many threads/processes that were doing that, then I could be sending batches of 1000 documents in parallel - in fact, you should probably look at the new unordered bulk API here:

var b=db.items.initializeUnorderedBulkOp()
for (i=1;i<10001;i++) {
    b.insert({_id:i, x: Math.random(), y:"abcxyz", foo:["bar",{baz:1}]});
    if (i%1000) {
       b.execute();
       b=db.items.initializeUnorderedBulkOp();
    }
}

You can send batches but the important thing is to parallelize the generation of the batches and send them as much in parallel as possible (this is why I don't recommend using server-side anything - that will be the limiting factor).   Same for trying to load a giant json file which you previously generated.

Asya



Sandip Chatterjee

unread,
Apr 24, 2014, 8:54:06 PM4/24/14
to mongod...@googlegroups.com
Forgot about the new unordered bulk feature! I haven't tested it out myself, but seems promising.

Abhi

unread,
Apr 25, 2014, 3:55:33 AM4/25/14
to mongod...@googlegroups.com
Thanks Asya for your insights.

This is what i am planning to do now - 
1. Generate documents and insert them directly in my code ( using javascript run on client side using mongo shell). 
2. Run multiple instances of such clients.

Please tell me if i am missing anything.

Thanks,
Abhi

Sebastián Estévez

unread,
May 27, 2014, 11:38:05 PM5/27/14
to mongod...@googlegroups.com
Thanks Asya,

This is very helpful. We'll be giving it a try!

--Sebastián
Reply all
Reply to author
Forward
0 new messages