Large collections vs Large number of documents in a collection.

734 views

Skip to first unread message

Archit Taneja

unread,

Mar 6, 2016, 2:02:55 PM3/6/16

to mongodb-user

Hi,

I do know, that this question has been asked a lot of times, but I see most of them are use-case specific.

So I couldn't help myself ask this again, according to my use case.

So here it is.
I have a project, which has 1000 images. From each image, I get 1 million documents that I currently store in a mongoDB collection.
I have 2 alternatives,

1. Create a seperate collection for each image, over multiple projects, say 10 projects, and each image(collection) would have 1 million documents that will be indexed.

2. Have a single collection and have 1000x1,000,000 documents in one collection, which again will be indexed.

My research and reading suggests that it is not a very good idea to create 1000 collections (for each image) and create an index for each one of 1000, and go with adding a billion documents to a single collection, to which mongoDB caters with no major issue.
My main aim is to enhance performance for reading. There would be no updates and deletes. Only reads and writes. I can afford writes to be slow, but not reads. So am I right in choosing the 2nd alternative?

Any guidance is highly appreciated.

William Byrne III

unread,

Mar 31, 2016, 3:50:56 AM3/31/16

to mongodb-user

Archit,

Splitting your billion document collection into a thousand collections each with a million documents won’t make queries (that use an index) noticeably faster, and it will make your application code much more complicated. I suggest you stay with the one billion document collection so long as you are confident that all queries will use an index.

While the total size of an index over a billion items is about 1000 times the size of the same index for only a million items, the resulting B-Tree does not have significantly more levels, so only a few more branch pages are traversed to reach the leaf nodes that point to the actual documents.

Let’s assume your users sometimes want to find:

some/all documents for a specific image
documents associated with different images with specific names, creators, creation dates etc.

You’d need have some compound indices with lead columns similar to the following:

 {"imageId":1, "docID":1} -- unique key for the whole collection
 {"imageId":1, "docCreator":1, ...} 
 {"imageId":1, "createDate":1, ...} 
 {"docShortName":1, ...}   
 {"docCreator":1, ...} 
 {"createDate":1, ...}

With the single large collection model your application would use one index that best fit the specific query criteria. With the separate collection per image model, queries where the imageId was known would use one index on one collection, but queries using the last three indices would have to be run against all 1000 collections and the results merged by your application. There would also be a significant additional administration load to keep all the indices on all 1000 collections in sync with each other.

If you have a query that can’t use an index, the whole collection will be scanned. Collection scans of a million documents are slow compared to index lookups and should be avoided, but collection scans on a billion documents are 1000 times slower again. Either way you will want all your queries to use an index, and for those indexes to fit into memory and stay there. You can use db.collectionName.stats().indexSizes to see all the indexes on a collection and their sizes in bytes.

If you really need to split up the billion document collection, I suggest you look at MongoDB’s sharding feature rather creating multiple collections yourself. A sharded collection has multiple collections on multiple servers presented as a single logical collection to the end user. This gives you benefits of multiple collections without the disadvantages I referred to in the paragraph above. See the Sharding section of the manual for more detail.