Prefer more objects or larger objects?

86 views
Skip to first unread message

Daniel Harman

unread,
Aug 10, 2012, 1:57:05 PM8/10/12
to mongod...@googlegroups.com
Hi,

I'm writing a chat application and I'm considering having an object per day with all the messages for that day embedded in a list. Alternative I could just have the messages in a collection of their own.

e.g. 
chatDay = { messages : [ { Id = ..., Msg = "Hello" }, { Id = ..., Msg : "Oh hi"} ] }

vs

{ Id = ..., Msg = "Hello" }
{ Id = ..., Msg : "Oh hi"} 

Obviously the former is going to mean a lot fewer individual objects being pulled back and forth so I think might be more efficient for reading, and more convenient for paging. Are there any performance consideration here? Would things start to degrade if a 'day' container started to get large due to a huge quantity of messages (n.b. obviously its going to go horribly wrong if I hit the object size limit!)

Thanks,

Dan

Octavian Covalschi

unread,
Aug 10, 2012, 2:55:06 PM8/10/12
to mongod...@googlegroups.com
I think this article may answer your question...

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Daniel Harman

unread,
Aug 10, 2012, 6:53:34 PM8/10/12
to mongod...@googlegroups.com
Thanks, It seems to confirm my intuition which is great as I just spent an hour implementing the bucket approach :)

Daniel Harman

unread,
Aug 10, 2012, 7:18:29 PM8/10/12
to mongod...@googlegroups.com
Although having said that it doesn't really talk about performance difference between modifying an existing object with for example a push vs inserting a new object into a collection. Are there any general principles to consider here?

Rob Moore

unread,
Aug 11, 2012, 12:52:19 PM8/11/12
to mongod...@googlegroups.com


On Friday, August 10, 2012 7:18:29 PM UTC-4, Daniel Harman wrote:
Although having said that it doesn't really talk about performance difference between modifying an existing object with for example a push vs inserting a new object into a collection. Are there any general principles to consider here?


An update and insert will probably be very comparable _if_ the document has not grown beyond the size of the currently allocated block.  If it does have to move the document then the insert is going to be faster.

The other issue to consider is that with each move/delete the document leaves a hole.  MongoDB is not very good at managing the holes created if you are not ensuring the documents are of uniform size.  A secondary effect is that after a while the collection of holes in the database slows down all allocations (straight inserts and updates that move) as it scan the growing free lists.

In 2.2, TTL collections will switch the collection to a "power of 2 allocator".  In theory that fixes the fragmented problem at the expense on 1, potentially, extra index with a TTL of "forever" and a little wasted space.

For me the question is do you ever plan to delete the documents?  If not then use a document per message and some smart indexing to group records for faster access.  The data will be packed into memory/disk as tight as possible.  You will still get temporal/spatial correlation since MongoDB will always append all of the messages to the end of the extents allocated.

If you will delete records its a toss up based on the primary usage pattern but you want the TTL collection's power of 2 allocator.

Rob

MKN Web Solutions

unread,
Aug 12, 2012, 10:16:37 AM8/12/12
to mongod...@googlegroups.com
Can anyone from the MongoDB engineering team confirm that this approach is ideal?  I just want to verify that this scheme is being used and works well.

Scott Hernandez

unread,
Aug 12, 2012, 10:40:34 AM8/12/12
to mongod...@googlegroups.com
It depends on a lot of factors (update/move rate, deleting, ratio of
inserts to updates, queries and ordering, etc), but it is a good
approach and one that works well for very active short write loads and
mostly reads, like an activity stream or time-based logging.

Daniel Harman

unread,
Aug 12, 2012, 6:44:40 PM8/12/12
to mongod...@googlegroups.com
Hi Rob,

Thanks for the in depth answer. Suggests I've now implemented the wrong approach and better to go back to object per message. So I guess I could index by a date field (with no time on it) to get the effect I have now. However, given that the messages are very small (think IRC not email), I am left wondering if this isn't going to cause a lot of seeking to load up messages by day? They will be temporally correlation of course, but in a table with a whole load of different chat going on they will all be interleaved. Is that something I should be able to ignore? 

Alternatively can I force a min block size for a table? I'm not sure it makes sense in terms of disk space consumption but worth considering.

I don't ever plan to delete documents and its likely messages will be cached locally on the web server anyway so perhaps seek time not a huge concern.

Dan

Rob Moore

unread,
Aug 12, 2012, 9:46:29 PM8/12/12
to mongod...@googlegroups.com


Echoing Scott's comment about there be a lot of variables but...

If the MongoDB cluster is sized to keep the last N days (hours) in memory then "seeking" isn't an issue except when going back beyond that horizon. 

You can index on the full timestamp and then simply do a range query.  e.g.:
    { timestamp : { $gt : Date(2012-08-12T00:00:00) , $lt : Date(2012-08-13T00:00:00) } }
The B-Tree indexes that MongoDB uses are designed to efficiently answer this type of query.

You can also create a compound index on { timestamp : 1, chat_name : 1} and it should speed up a query using both a range on timestamp and a range or value for the chat_name.

The only option I know if (other than the bucketed documents) to group the messages into chats is to use a collection per "chat" but I'd not recommend that unless you can enumerate the chats before hand.  I have heard issues about scaling the collection count into the thousands but I prefer to just not go there.

The only mechanism I know of for controlling the allocation of blocks in MongoDB is the TTL Collections.  I'm eagerly awaiting the 2.2.0 release so I can take it for a spin on my current project.

Rob.
Reply all
Reply to author
Forward
0 new messages