speed up index generation?

Sandip Chatterjee

unread,

Apr 23, 2014, 4:04:43 PM4/23/14

to mongod...@googlegroups.com

Hi,

I have a large collection (approaching 3 billion records) on a single Linux machine with a regular spinning hard disk drive as my dbpath location. The documents in the database are small (average size 286 bytes).

After importing all documents into the collection, and with no other query/write activity occurring simultaneously, index generation on an integer field in this database takes 6 to 7 days to complete.

The machine has 16 cores with 96GB RAM, and is running openSUSE 11.3. The database size is around 750GB without indexes.

Is there anyway to speed up this process? Would a sharded setup help in this case? Or SSD storage? Monitoring CPU activity, it seems that a single core is pegged at 100% for the duration of the index generation, so it seems CPU-limited... and I'm not sure that there is a way to have it take advantage of multiple cores on a single machine.

Thanks in advance!

Sandip

Andrew Ryder

unread,

May 26, 2014, 11:30:27 PM5/26/14

to mongod...@googlegroups.com

Hi Sandip!

It sounds like your index build is limited by the speed of the sorting stage. I'm surprised that an index of that size would be CPU bound but that is what you have described. Index builds can't really be sped up, they have to go through every document and sort them against every other document. I would expect this step to be IO/disk bound (on large datasets) rather than CPU bound. Can you double check the symptoms,. in particular, check your disk IO to ensure you haven't capped out there.

As an aside, 750GB of data stored with an average of 286 bytes per document is going to get inefficient very quickly by adding more indexes. If you only used _id to refer to these documents it would probably be fine, but adding further indexes is going to burn through memory very quickly. An index on an integer field of 3 billion documents is potentially over 10 gigabytes of integers alone - that is without even considering the overhead of B-tree buckets and document references. It sounds like you need to think about your schema a bit - can you merge data into larger documents? Do you really need this index?

Kind regards,
Andrew

Sandip Chatterjee

unread,

May 27, 2014, 11:51:10 AM5/27/14

to mongod...@googlegroups.com

Hi Andrew,

Thanks for the response and for the information. Here's some updates since my original post and answers to your questions.

1) Although the CPU usage does seem to be pretty high during the whole index generation step, I think you're absolutely correct about it being disk IO-bound. We tried putting the entire database on SSD storage (sans-index) and generating the same exact index -- this time ensureIndex took about 5 hours instead of 6 days! (the only other change was upgrading from Mongo 2.4.3 to Mongo 2.6, and I don't see anything in the changelog that would drastically affect that)

So... SSD storage seems to be the way to go for a database like this.

2) Yes, unfortunately we do need this index (actually this is the only index we really need). The values for this integer field are not unique, so we can't just set the field to _id and allow it to be the only index.

We've explored alternative schema a bit. The easiest way to reduce the number of documents would be to group data by this same integer field that we're indexing on. The problem there is that we would end up with giant documents (way larger than the 16MB Mongo limit). We would probably end up with an order of magnitude fewer documents though. We'll continue exploring this option...

Sandip

Reply all

Reply to author

Forward