Archit,
Splitting your billion document collection into a thousand collections each with a million documents won’t make queries (that use an index) noticeably faster, and it will make your application code much more complicated. I suggest you stay with the one billion document collection so long as you are confident that all queries will use an index.
While the total size of an index over a billion items is about 1000 times the size of the same index for only a million items, the resulting B-Tree does not have significantly more levels, so only a few more branch pages are traversed to reach the leaf nodes that point to the actual documents.
Let’s assume your users sometimes want to find:
You’d need have some compound indices with lead columns similar to the following:
{"imageId":1, "docID":1} -- unique key for the whole collection
{"imageId":1, "docCreator":1, ...}
{"imageId":1, "createDate":1, ...}
{"docShortName":1, ...}
{"docCreator":1, ...}
{"createDate":1, ...}
With the single large collection model your application would use one index that best fit the specific query criteria. With the separate collection per image model, queries where the imageId was known would use one index on one collection, but queries using the last three indices would have to be run against all 1000 collections and the results merged by your application. There would also be a significant additional administration load to keep all the indices on all 1000 collections in sync with each other.
If you have a query that can’t use an index, the whole collection will be scanned. Collection scans of a million documents are slow compared to index lookups and should be avoided, but collection scans on a billion documents are 1000 times slower again. Either way you will want all your queries to use an index, and for those indexes to fit into memory and stay there. You can use db.collectionName.stats().indexSizes
to see all the indexes on a collection and their sizes in bytes.
If you really need to split up the billion document collection, I suggest you look at MongoDB’s sharding feature rather creating multiple collections yourself. A sharded collection has multiple collections on multiple servers presented as a single logical collection to the end user. This gives you benefits of multiple collections without the disadvantages I referred to in the paragraph above. See the Sharding section of the manual for more detail.
III