Collection vs Database

683 views
Skip to first unread message

Brian Carpio

unread,
Sep 30, 2011, 9:10:32 PM9/30/11
to mongodb-user
From an operations perspective what is the overhead of having 100s of
database vs 100s of collections especially in a subscription based
model like SalesForce.com or even Jira.

While I understand this is a VERY general question any blog posts or
shared experiences would be very beneficial.

Off the top of my head I can see that there are some benefits of
having 100s of database

- One you can drop a database which represents a customer without
impacting other databases or having to run a compaction on the
database.
- Statistics per database. I guess this can really be a pro and con,
but it seems that statistics per customer or database would give you a
better perspecitive into that customer's use of the application but it
can also create a management nighmare for the operations team if they
wish to monitor databases as this leve.
- What impact does multiple database have on sharding vs collections?

I know companies such as cragislist.com have turned to mongodb for
archiving of their posts, (not sure if people can release this info)
but how does craigslist.com decide to split databases vs collections?
Is there a database per city? Per category? One big huge database with
collections split by category or city (since I assume all categories
are the same across cities?).

Any info that can be presented would be helpful.

Karl Seguin

unread,
Sep 30, 2011, 9:47:16 PM9/30/11
to mongod...@googlegroups.com
I think a big factor, which would drive a lot of people's decision, is that autosharding happens on a collection key. So if I was craigslist and my archive was going to be sharded (which I kinda assume it is, but I have no idea)...I wouldn't split my cities per database...I wouldnt' split my cities per collections...I'd have all similar entities (like a post, and a user) in their own collection (like you'd have with a database) and I'd set up a shard key on whatever made sense (like the city for posts).

Multiple databases seem like a pain to me...if you were building a forum, you wouldn't create 1 database per thread. Multiple collections is less crazy (again, to me), but mongo can't shard that (yet)...so sharding would have to be handled manually by your code.

From a reporting/query/statistics/whatever...there's nothing you can do with a per-database/collection that you can't with a more normal approach:

//per collection
db.posts_newyork.mapReduce(...)

//normal single collection
db.posts.mapReduce({query:{city: 'newyork'}})

But, there are things that wouldn't be easy to do the other way. If you want to generate aggregate statistics for multiple clients, then having the data artificially split across collections or database becomes a problem.


I will say that breaking things up per database/collection can save you some memory/space as you'll normally be able to avoid the index on the field that you are partitioning your data...although, again, at the cost of auto-sharding.

Karl

Brian Carpio

unread,
Sep 30, 2011, 10:12:37 PM9/30/11
to mongodb-user
Thanks Karl.

So if sharding happens on a collection key and each database shares
the same collection key, lets say in craigslist case we shard on
user_name and in Denver we have users A-Z just like we do in Seattle
(obviously people named Bob live in each city) can sharding happen
across database if the collection key is the same across databases?

Or can sharding happen only per database so Denver would potentially
shard differently then Seattle... but still wouldn't auto-sharding
still happen, you might have more people named Bob in Denver and less
in Seattle so in a two shard situation A-M and N-Z, so shard_A would
have more people named Bob so Denver data might be higher for A-M but
there would be less Seattle data because people named Mike are higher
in Seattle (this is just some crazy example to get a point acros).

Or would mongo choose not to shard the data because each database is
soo small... Denver and Seattle each only take up a few GB worth of
disk space so there is no reason to shard?

Thanks,
Brian

Karl Seguin

unread,
Sep 30, 2011, 10:37:24 PM9/30/11
to mongod...@googlegroups.com
You are right, you could split your data across multiple databases, and then still take advantage of sharding on a sub-key. I still think that, in most cases, this is the wrong approach to take (at least upfront...maybe over time you'll be driven to such an approach, but I doubt it). 

As far as I know, if you tell Mongo to shard it'll do so, regardless of size. However, there's overhead with sharding (more so with some operations than others), so sharding small collections is generally a bad idea.

Steve Francia

unread,
Oct 3, 2011, 1:36:47 PM10/3/11
to mongodb-user
A few more things to consider.

Creating a new database is more expensive than a new collection.
A database has a minimum file size of around 200Mb
Security is handled on a db level, so you would need to either create
an admin user to access all dbs, or one per database.

In general you will have more administration overhead with the
multiple DBs approach.

Performance between the two would be comparable overall.

There are some performance issues once you have 100s of 1000s of
databases in a single node (mostly due to how linux handles the
files).
Though this doesn't sound like a concern in your case.

It does seem a bit backwards to shard databases in the way you are
describing. Effectively you are already sharding the data manually
into separate databases. Why would you then want to collectively shard
them again?

Brian Carpio

unread,
Oct 3, 2011, 11:16:51 PM10/3/11
to mongodb-user
Actually I think each database takes up 80MB of disk space (unless
this changed in 2.x?)

-rw------- 1 root admin 64M Sep 28 15:29 first_database.0
-rw------- 1 root admin 16M Sep 28 15:29 first_database.ns

64MB is the first data file and ns is the name space created for the
database...

Karl Seguin

unread,
Oct 3, 2011, 11:25:39 PM10/3/11
to mongod...@googlegroups.com
Even though they don't recommend doing it in production, what if you run mongod with --noprealloc :

(I'm finding myself too lazy to test it out myself at this exact moment...)

Karl
Reply all
Reply to author
Forward
0 new messages