How best to design an efficient, scalable, multi-tenant data layer using MongoDB?

1,340 views

Skip to first unread message

Aaron J Ban

unread,

Mar 2, 2014, 3:14:34 PM3/2/14

to mongod...@googlegroups.com

I'm working on the architecture for my upcoming Project Mangement app (as an example) and I'm seeking clarity on how best to design the MongoDB data layer, with specific regard to multi-tenancy. The app will have multiple 'sub-apps' (e.g. Calendar, Tasklist, Media, Team, etcetera) which would each map to a Collection in the database (either a centralized DB or its own Project DB).

DB Server == Replica Set.

The Questions

Should I use one giant, centralized database to store all the application data, or create an individual database for each Project that is created on the system?
If I choose the individual DB strategy, does that obviate the need to shard the data layer given that the DB's are 'naturally' dispersed across several servers, thus 'naturally' spreading the load across several servers? The application would contain logic that tells it which server to access the data for any given Project.
Would using individual DB's for each Project give me better performance (given that to find any given document, Mongo would only have to search at most a few thousand docs in the individual Project DB vs. potentially millions in a giant, centralized DB)?
Is it at all possible to reduce the 32M minimum footprint for a MongoDB database? I've read the documentation in the --smallfiles manual, but that didn't really answer my question. Is this a hard minimum?
If any given Project received a large amount of traffic, and became a 'noisy neighbour', would the solution just be to spin up a new DB Server and move that Project to the new server? or would it be a better approach to shard the DB Server that houses the noisy neighbour to increase performance on that server?
What 'maintenance' concerns would I have with regard to cleaning up space for any given deleted Project, and/or 'shrink-wrapping' each DB to minimize it's footprint as close to the actual amount of data stored in any given Project database?
What concerns should I be aware of with regard to future changes in the data 'schema' that would have to be 'rolled-out' across all the Project DB's? Given that Mongo is 'schema less', is it correct to assume that if I want to add a new 'field' to any given Collection that I would just do so in the app logic, without having to roll out any updates to the DB's themselves?
What MongoDB 'tools' would I use to get information about the current 'status' of any given DB Server?
Are there any limits to the number of DB's that can be housed on any given DB Server that I should be aware of?
How does the individual DB strategy impact back-ups? Are there any concerns I should be aware of when backing up (to S3 for disaster recovery) many DB's across many DB Servers?

The App Stack

Ubuntu 12.04 LTS
Nginx
node.js
express.js
MongoDB

Current Working Strategy

My current working strategy is to use one database to store the higher-level, 'global' data like Users, Notifications, Messages, Usage, and Preferences. And then create a new database for each project created on the system.

This seems like the ideal approach for many reasons: security (each DB has its own creds), catastrophic recovery (since if one DB Server goes down the entire app doesn't go down), and performance (I think, since Mongo would have to search far fewer docs to find the one it's looking for).

The application would contain logic that automatically detects available space on any given DB Server and creates the new Project database on the next available DB Server.

According to this article provide by MongoHQ, this is the 'best' strategy, although it consumes a large amount of storage. Especially since each DB takes up 32M even when it's empty. Which gets very expensive using a service like MongoHQ if you're offering a 'Freemium' app that gets Techcrunch'd.

So in a scenario where ProjectManager has three projects on the system my data layer would look like so:

ProjectManager Users Notifications Messages Usage Preferences Projects Project01 Calendar Tasks Media Team Project02 Calendar Tasks Media Team Project03 Calendar Tasks Media Team

Each of the above ProjectXX DB's would be tiny. Each one storing about 2000-3000 documents each at most.

Thanks in advance for taking the time to provide any insight.

Anand George

unread,

Mar 3, 2014, 12:40:09 AM3/3/14

to mongod...@googlegroups.com

I would approach it a little differently considering MongoDB's strengths:

1. One sharded replica set (2+1 min) configuration.

2. Use project id as the shard key. Assign project id's so each shard has a mix of project types (low traffic and high traffic), so load on each shard is balanced out.

3. Schema would be handled by application logic. So you could have different business logic while keeping the database common.

4. A common database also enables aggregation and other tasks required for reports etc which require merging of data from multiple projects.

5. Use monitoring tools like MMS for monitoring and backup.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

s.molinari

unread,

Mar 3, 2014, 2:52:53 AM3/3/14

to mongod...@googlegroups.com

Hi,

Firstly, a disclaimer. I am learning Mongo like yourself.

Here is an interesting post similar to yours. https://groups.google.com/forum/#!topic/mongodb-user/lc-_8gR8OVU

In general, something I've learned is, Mongo must be used as it best fits against how you (or your customers and their users) will be accessing the data stored. Being you are going to have a lot of smaller databases, the issue with file allocation size will be a factor. Mongo databases take up around 200MB as the default (1st file 64MB, 2nd file 128MB). The 32MB minimum is the size only when all the default settings, settings which are quite important for performance, like preallocation, standard allocated file sizes and the journal (which is more for atomically safe write operations) or a minimized oplog, are all turned down or off. This is not a good idea for a well running Mongo system and should be avoided.

I am not sure what you mean by "given that the DB's are 'naturally' dispersed across several servers". Mongo databases are not naturally dispersed, unless you yourself have them on different machines and then, I wouldn't consider that as natural.:) I can't imagine running db's on multiple machines for the sake of dispersity being cost effective either. From what I've learned, a minimum production set-up would be a 3 server replica set with some fairly decent hardware (especially with a decent amount of RAM). When access to the data becomes excessive for the replica set capabilities or when the size of the data starts to grow beyond a single server's capacities, that is when could consider scaling. You can scale vertically, in that you get bigger servers or you can scale horizontally with sharding. Sharding can be done at the collection level, so this doesn't come into play, in the question of one db per customer or not.

What does come into play is database maintenance, which you mention in point 6. If you mix customers over a small number of databases, then it becomes more of a chore and a possibility for mistakes to happen, if you have a lot of customer churn. If you're not sure your customers will be staying for a while, you might want to consider this aspect. Compacting is possible, but it also means downtime, unless you are on a replica set, which can mediate compacting downtime.

Your point 5 is also something I've been considering and there is no easy answer from what I've learned. I am not sure about your specs, but we are aiming at as little downtime as possible and any "movement" of data = downtime. We haven't come to a conclusion on the best practice for our "successful customers". I like to think of customers needing more database power (i.e. being noisy) as being a good thing.;)

Mongo is a RAM intensive application, because in order for it to be fast, it stores a lot the used or "hot" data in RAM (also known at the working or active set). This is a great aspect of Mongo, as it basically alleviates the need for a caching system like memcached, but also makes its operation costly (which is the same for any DB needing a caching system), as you can't just think you have 1TB of space available on disk and can then run 1TB of databases on that machine with say only 2GB of RAM. That isn't going to fly. I have yet to find any spec that says, X GB of used disk space = X GB of RAM and I don't think I will ever find one either, simply because you it is very hard to tell who needs what data when, especially for our type of system. So this is going to be a very tricky part of our Mongo usage.

For point 7. MongoDB is not really "schemaless". You still have to think about schema and data modeling. Let's call it semi-flexible schema. And I say semi-flexible, because yes, you can tack on fields at will to any document (and this is a beauty of Mongo), but you must also be aware that tacking on fields, which may go beyond the padded size of a document, when the document is updated, is a no-no in Mongo. This is because it means the document has to be physically moved to another part of the disk, causing a real knock in performance for updates of older records.

One other thing to remember too is Mongo write locks at the database level. If your application tends to be write heavy and will be for all of your customers and their users, this could come into play and might push you towards the one db per customer scenario. If you are making a "normal" web application, it probably isn't write heavy, so you are ok here.

We are actually thinking about going with a sort of hybrid system, but our requirements are a bit different than yours. Our customers won't have similar data schemas, or rather, we won't know what our customer's schemas will look like. We are going to have two systems and why I call it hybrid. The first and "starter system", will spread customers over a single database, to basically combat the 200MB default database size. We want to give each customer 50MB data storage to start. That means 4 customers per DB. This "starter" system will not be a full replica set, but rather 2 replicas and an arbiter (to save costs). It will be clear, data backups are on a daily basis only!!). As a customer grows and gets close to 200MB (which actually means they are already taking up 0.5GB!!), we will move them to their own database on a full blown replica set. This will be a planned move and we will make it clear, this will incur downtime. If the customer should get even bigger, then she could actually get moved to her own database cluster. The possible "downtime" needed for this move is, however, something we want to avoid (as obviously this customer is successful and we don't want to bother customer success!!!) and are still looking at ways to get this move to a newer server done as painlessly as possible.

I hope I could help and I am hot to hear the counters I'll be getting for my answers from the pros.;) Again, take my comments with a grain of salt. I am learning too.

Scott

Message has been deleted

Aaron J Ban

unread,

Mar 3, 2014, 4:33:20 PM3/3/14

to mongod...@googlegroups.com

Wow. Thanks so much Anand and Scott. Such amazing answers. I truly appreciate you taking the time. You've both pretty much confirmed my new current working strategy:

Use a centralized DB on a replica set
Shard when traffic proves it's necessary
Move noisy neighbours to their own DB if absolutely necessary

And using this strategy I'll need to take care to pay attention to the following:

ACL's for the co-mingled data (and static files)
Establishing a shard key now, and never, ever changing it

With this strategy it's financially feasible to use a service like MongoHQ (or MongoLab, or ObjectRocket) rather than spinning up my own servers (phew), and allow me to offer a freemium pricing model if I decide to do so (or in the very least a free trial).

Thanks again guys, really appreciate it. I'm now going to focus on data security and I'll post back with anything relevant I find.