Large number of small documents vs smaller number of large documents

990 views
Skip to first unread message

Aleksei T

unread,
Feb 19, 2011, 6:58:27 PM2/19/11
to mongod...@googlegroups.com
This group is so helpful I can't stop asking things :)) 

We are considering a design where we would store a document with an element representing a flexible unique combination of several values, in the spirit of the GROUP BY clause in SQL, and a flexible number of counters for it.

We are choosing between the better approach for performance and the way it would be queried:

Approach 1 - flatten out the structure and have a large number of really small documents representing all the unique combinations for a particular time period (e.g. day), example:

{ id:1, time:1298159028, groups:["x","y","z"], aggregates: { bytes:3245, requests:345 } }
{ id:1, time:1298159028, groups:["a","b","c"], aggregtaes: { bytes:4532, requests:124} }
...more documents like that...

Approach 2 - combine all the unique combinations as sub-elements in a single document for the time period, example:
{ id:1, time:1298159028,
        data: {
              "['x','y','z']" :  { bytes:3245, requests:345 }
              "['a','b','c']" :  { bytes:4532, requests:124}
              ...more elements like that...
        {

So, basically it's a choice between the a larger number of smaller-size docs (flattened out) vs a smaller number of larger docs per similar time period.

On the query side, it will likely be queried by the id and a range of datetime period, not individual unique group items. 

So, my initial gut feeling is that it's better to go with #2 as query would return a single document per time period.  The downside is the arbitrary size of the resulting document and a concern that it might hit the upper document size limit (16MB), but it is probably not a huge concern.

Would be interesting to hear pros and cons as I am sure a lot of people have made trade-off decisions like that and could share some of the experience :)

Thanks everyone, this group is great :)


Scott Hernandez

unread,
Feb 19, 2011, 7:01:35 PM2/19/11
to mongod...@googlegroups.com
If there is a chance it could hit the upper limit you are better off going flat.


Also, there is an extra cost associated with constantly growing documents.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>

Aleksei T

unread,
Feb 19, 2011, 7:14:35 PM2/19/11
to mongod...@googlegroups.com
Thanks, good point, and that's my main concern about growing docs, but I am not overly worried about hitting the upper limit as we will guard against this data "explosion" on the web application side of things when configuring the data collection groupbys. 

But, I am really interested in hearing about the performance-related trade-offs like the one about the extra cost of growing documents that you mentioned.  But isn't there a higher overhead of returning a large number of smaller documents vs one larger document, too?  Is there a clear benefit of one approach vs another based on the internal workings of MongoDB?  I understand it is meant to store a huge number of documents, but it's also good for storing a bit deeper documents, too.  I guess the answer here is trying both on a control data set all other things being equal and seeing the difference and going from there. 

I was just wondering if anyone went one way or the other and then changed it around or confirmed it was indeed the right way to go the first time :)  Want to learn from other people's mistakes huh :)

Keith Branton

unread,
Feb 19, 2011, 7:43:38 PM2/19/11
to mongod...@googlegroups.com, Aleksei T
If you're absolutely sure the 16MB limit won't bite you later then you can consider larger document some more. The possibility of ever exceeding the limit would normally be enough to force your hand and change the design of your database. Still, 16MB is pretty big.

Scott's other concern is very valid *IF* you grow these documents after first insertion, but it's not clear from your question if you do or not. If you compute the entire document, and insert it and don't update it (or at least don't make it bigger with updates) then you won't have a problem here. If you are planning to incrementally add data to existing documents then you may end up with a very large padding factor, and a lot of wasted space in your database files, and some time spent moving documents around as they grow.

Lots of small documents will result in a much larger index. Depending on how you access the data this can significantly increase working set size which can cause problems if you have less ram. They will also mean you are repeating the id and time fields on every record. That's a lot of extra space - and while space is cheap, accessing it quickly is not. With databases small design is better, because from time to time it has to go to the disk and that is slow.

So you see there are trade-offs at play here.  I recommend prototyping this using your production data and hardware to see how each approach performs. Hopefully depending on your non-functional requirements a clear winner may emerge. I'm not sure anyone could predict this for your data and use-cases.

Aleksei T

unread,
Feb 19, 2011, 8:18:45 PM2/19/11
to mongod...@googlegroups.com, Aleksei T
Great, thanks for insight, very helpful.  The insert behavior will be such that it will definitely grow over time (probably faster initially and level off as it saturates the volume of the unique data set), with a large number of $inc upserts in the document being performed continuously on the unique groupby elements as the method to populate the values and increment the counters. 

So, yes, the answer is that the document will grow, and the total resulting number of elements can vary dramatically from document to document, as one may be configured to collect data without any grouping, so it will contain just one "row" per time period, and some would have a high cardinality of results and contain hundreds or possibly thousands of "rows".. 

Will need to experiment with this, thanks for sharing thoughts, appreciate it.

Scott Hernandez

unread,
Feb 19, 2011, 8:36:14 PM2/19/11
to mongod...@googlegroups.com
With how the padding works it will learn that you are growing objects
but it could take a while till it adjusts to the size, and will most
likely create a bit of overhead for each document.

It is def. worth testing, and I expect you will see a lot of changes
within the time period, and then the data will be pretty much static.

Aleksei T

unread,
Feb 19, 2011, 11:07:23 PM2/19/11
to mongod...@googlegroups.com
OK great, thanks, will experiment with both and post some results here later.  Thanks again!

bingomanatee

unread,
Feb 20, 2011, 12:51:55 PM2/20/11
to mongodb-user
I would suggest you actually build out both methods and generate
random data, then do time trials on some typical requests. It'd be a
cheap experiment. At the least you could get some data as to how bit a
record can get (in items) before you get near the 16MB limit.

I would also point out that the flat solution is really no faster in
Mongo than it would be in SQL. At the risk of geting spammed at, if
you have a flat list of data records with consistent fields, unless
you are interested in the sharding
Reply all
Reply to author
Forward
0 new messages