Hi Andrew,
Thanks for the response and for the information. Here's some updates since my original post and answers to your questions.
1) Although the CPU usage does seem to be pretty high during the whole index generation step, I think you're absolutely correct about it being disk IO-bound. We tried putting the entire database on SSD storage (sans-index) and generating the same exact index -- this time ensureIndex took about 5 hours instead of 6 days! (the only other change was upgrading from Mongo 2.4.3 to Mongo 2.6, and I don't see anything in the changelog that would drastically affect that)
So... SSD storage seems to be the way to go for a database like this.
2) Yes, unfortunately we do need this index (actually this is the only index we really need). The values for this integer field are not unique, so we can't just set the field to _id and allow it to be the only index.
We've explored alternative schema a bit. The easiest way to reduce the number of documents would be to group data by this same integer field that we're indexing on. The problem there is that we would end up with giant documents (way larger than the 16MB Mongo limit). We would probably end up with an order of magnitude fewer documents though. We'll continue exploring this option...
Sandip