Multiple Databases vs. Multiple Collections vs. Single Collection

1,114 views
Skip to first unread message

Tucker

unread,
Oct 26, 2011, 9:14:10 PM10/26/11
to mongodb-user
I'm moving from an RDBMS to MongoDB and would like some advise on how
to represent my data. My application represents data in Projects and
Uploads. An upload contains on average 300,000 documents (currently
rows) and a project consists of multiple uploads (upwards of 100). My
frontend application needs to present data per upload and per project.
Would it make sense to create a separate MongoDB database for each
project and each upload would be a collection in the database? Or
would it be more Mongo-like to have to single database with a single
collection and create an index on the project_id and upload_id
attributes in a document for efficient querying? I would like to take
advantage of MongoDB's autosharding ability as the data will grow
beyond the size of a single node. I could set the shard key to the
upload_id so uploads could be efficiently sharded. Any suggestions or
advise would be greatly appreciated.

Karl Seguin

unread,
Oct 26, 2011, 10:16:22 PM10/26/11
to mongod...@googlegroups.com
Probably:
1 database,
2 collections: uploads and projects

Karl

Tucker

unread,
Oct 27, 2011, 11:40:14 AM10/27/11
to mongodb-user
Thanks for the quick response. A further question I have is about
indexing in this model. Virtually all of my queries will be fetching
documents scoped to a project and an upload such as
'db.documents.find({ project_id: 1, upload_id: 20, name:
"some_name" })'. For this reason I thought having individual
collections per upload would help with query performance. If I were to
have one collection per upload, the query would then look like this,
'project_1_db.upload_20.find({ name: "some_name" })'. However, would
it be just as beneficial to stick to a single collection and use
compound keys for my indexes, such that an index on name would be a
compound on project_id, upload_id, and name? I'm in the process of
prototyping to get some performance statistics but I wanted to get an
understanding of MongoDB 'best practices'. Thanks for the help.

Karl Seguin

unread,
Oct 27, 2011, 11:56:59 AM10/27/11
to mongod...@googlegroups.com
You are being a little premature on that optimization. Not that I don't like the idea, but the problem that you'll run into is that you won't be able to shard it (which you mentioned wanting to do in your first message). There's a feature request up for sharding on collections (https://jira.mongodb.org/browse/SERVER-939), which you should upvote if you are interested in it...but for the time being, you can only shard data within a collection.

I guess you need both the project_id and upload_id because the upload_id isn't unique (it starts back at 1 for each project). It doesn't really change anything, but I guess I have envisioned that each upload would have a unique id, so that it'd be:

db.uploads.find({project_id: ObjectId("bkaljdlasjd")})
db.documents.find({upload_id: ObjectId("blahblahblah"), name: 'abc'})

So that your indexes would be on project_id for uploads, and upload_id and name for documents. You can denormalize the project_id and stick it in documents, but again, I wouldnt' do that until it proved necessary.  Again, this last part doesn't really change anything, just thinking out loud.

TL;DR - your solution, which I've thought about using myself before, doesn't work with the autosharding.

Tucker

unread,
Oct 28, 2011, 10:00:25 AM10/28/11
to mongodb-user
Thanks for the advise. I'll stick with a single collection for
Document and Upload and embed the Project in the Document as you
suggested. A Project will only contain a few fields such as user_id,
name, and timestamp. I'm also going to stick with the default _id key
for the shard key.
Reply all
Reply to author
Forward
0 new messages