Millions of collections per mongo database instance

1,014 views
Skip to first unread message

Muhammad Irfan

unread,
Jan 22, 2015, 8:01:57 PM1/22/15
to mongod...@googlegroups.com
I have a scenario where millions of users are expected to connect to a web application. The application is collecting different types of data (resulting in different Json document structures) and is persisting to the specific type of collection (I have distinct collections defined for each type of data). So let's say 5 types of data structures meaning 5 million collections when I actually have a million users.
I haven't run an actual test to see if it is possible to have so many collections in one database in mongodb, but I know that the maximum limit of 24000 on the number of collections can be increased by increasing the db.ns file size from the default 16MB.
The question is, even if it is possible to increase the number of collections, the mere fact that 10gen put the limit to 24000 makes me believe that using such a huge number of collections will degrade the performance somehow (I know that more collections means more disk usage for the same number of Json documents. But disk space is not so important to me; unless it means more RAM usage too. Does it?)

Other possibilities could be;
1. I use one collection (instead of 5), putting all types of data for a certain user into the same collection. But even that would reduce the number of collections to 'one million per database'. How much better is that, while I'm losing the 'read' performance since the various types of documents have now been merged into one collection.

2. A much more aggressive approach turning the whole data model upside down would be to use one collection per type, resulting in 5 collections in the database. A monumental loss in performance since the data belonging to a million users is now residing in one collection. (So it's just a raw possibility, not worth considering, IMHO).

In summary, what are the cons for using 5 million collections per database instance in mongodb?
Should I tend to find a middle ground, mixing the data of a number of users (let's say 100 users per collection)? In that case, would it be possible to somehow still instruct the database to make separate fragments within the collection for each user, to optimize reads for a given user?

Cheers

s.molinari

unread,
Jan 23, 2015, 5:13:56 AM1/23/15
to mongod...@googlegroups.com
The data model should be based more around your data structures and also your access patterns. You shouldn't make multi-tenancy over collections. I believe that would be the worst scenario. 

If all users share the same data structures and same access patterns within the same application, then your best bet is one database with collections per data structure (I am saying this not knowing your access patterns). If the data needs to be accessed in a certain way, then you can also think about normalization or denormalization of the data accordingly. To help you with these decisions though, we'd need to know what the data structures would look like and what kind of data out of those structures need to be accessed and in what ways and how often.

Scott

Muhammad Irfan

unread,
Jan 23, 2015, 9:39:19 PM1/23/15
to mongod...@googlegroups.com
"You shouldn't make multi-tenancy over collections. I believe that would be the worst scenario."
Knowing 'why' that would be the worst scenario is one main purpose for this post, I knew there was something fishy. So can you please give some more explanation on that.

So if I understand correctly having data for a million users in one collection (all users share the same structure of document for one service) is normal (keeping in view that a user would add only 10s of documents per day).

Speaking of the access patterns, mostly simple queries are involved in the application, accessing documents on the basis of userId, timeStamp and certain type field which mostly lie on the first or second level in the document.

I'm using denormalized data most of the time to achieve atomicity as much as possible, unless it is absolutely required to normalize it. One example would be the case where lists contain items, items are put into their own collection with a reference to the corresponding list. This is because the lists are expected to grow unbounded which might result in exhausting the 16 MB document limit if the items were put directly into the list object.

s.molinari

unread,
Jan 24, 2015, 4:27:32 AM1/24/15
to mongod...@googlegroups.com
The main reason for avoiding multi-tenancy over collections is the collections have a built in limit. If you decide later in the life of the application, that a new data structure and new collections are needed, if you are anywhere near close to the limit, you are out of luck with adding that new feature/ data. The only choice you'd have is to start up another database instance and expand on that, which would make your application that more complicated and open to problems. It is possible to do though.

Another point to make is this. You gain nothing by spreading data over the collections. In fact, you are fighting the whole goal of sharding, since sharding is at the collection level. In other words, everything that is built to make MongoDB perform well, you are sidestepping, by trying to make your system perform better over "smaller", but many, collections. That isn't what MongoDB was built to do. 

That is as best I can put it in my, granted, limited knowledge. 


Your suggestion is the second choice mentioned. Shared Database, Separate Schemas. Whereas, you are suggesting the schemas will actually be the same, which means, taking this direction makes even less sense, because why separate, when the structure or schema is the same? You'll notice too, this direction is least talked about in the post, simply because it is the least effective.

My project is also taking a multi-tenant approach, but because our tenants will have such a diversity in data and opportunities to create their own models of the data, and for us to still keep things as simple and efficient as possible, we are going to a database per customer approach. At first, we were also were planning to have "smaller" customers on a shared database (your suggested route), because of the storage allocation scheme MongoDB has, but the new WiredTiger storage engine doesn't preallocate large data files and thus, we can offer this storage system to even small customers. 

Obviously, the number of databases per server instance is also a limit, but the fact that MongoDB, as a database service, can be automated completely, we don't see this as too big an issue. Firing up and maintaining new database instances shouldn't be a limiting factor. If anything, finding a provider to handle the business would be. And, we'd love to be so big, that that does become even the slightest of issues.:) 

We are also going the database per customer/ tenant route, because we want our customers to be as autonomous as possible, should their usage of the system be very successful. We can wean/ morph/ spawn them onto their own database cluster,  at some point in time, if need be and offer them even better service.

At any rate. I hope that helps a little more/ better. 

Scott


Muhammad Irfan

unread,
Jan 24, 2015, 11:49:32 PM1/24/15
to mongod...@googlegroups.com
Thank you for such a detailed explanation Scott, I believe I'm looking at Mongodb from a whole different angle now!
Reply all
Reply to author
Forward
0 new messages