how best to use mongo for messsages

655 views

Skip to first unread message

J

unread,

Dec 15, 2011, 10:40:12 AM12/15/11

to mongodb-user

hi all, I'm new to mongo so trying to figure out good object design
for efficient query performance. Just going over structures for a
private messaging system as part of a larger app and wanted to know
any thoughts on there limitations, advantages, flexibility,
variations, etc

The PM system functions as you would expect and most of the messages
will be reasonably short and between just two users, however I need to
plan for multiple recipients.

A large proportion of the messages will be associated to a relevant
'document' object and referenced within that document as a link but
not displayed directly. When that document is archived the associated
messages are also archived, probably by embedding them in to the
document itself. This is likely to happen after they have dropped off
the users inbox, which will have an auto discard timeframe unless
specific messages are saved by the user. Saved messages would be
stored away from the messages collection, maybe embed to there profile
object.

The users will have an inbox (unread count), sent, drafts (embedded in
profile until sent), trash (10 day deletion)

Here's some ideas on its structure. Any advice would be appreciated.

structure 1: messsage stream is an object and messages embedded in the
stream. Logic for archiving, status and deletion from a users inbox
without affecting others seems fairly straight forward and im thinking
a $push on messages and status would be ok? can bolt on attachment
links easy enough if needed in the future

{ _id: 'object id',//used as stream id
involved: [ 'user1', 'user2', 'user3']
subject : 'that thing',
association: 'item1'
status: [ { user1: 'unread-1' }, { user2: 'read' }, { user3:
'unread-2' } ] //could just use numbers
timestamps: [ { user1: 'ts' }, { user2: 'ts' }, { user3: 'ts' } ]//
used to highlight unread in stream
messages: [ {
sender: 'user1',
recipient: [ 'user2', 'user3' ]
timestamp: 'timestamp',
message: 'users message',
}, {
sender: 'user2',
recipient: [ 'user1', 'user3' ]
timestamp: 'timestamp',
message: 'users message',
} ]
}

structure 2: each message is an object. More traditional approach and
maybe more flexible for queries and indexing? results in many more
objects though. Not sure how the extra rundancy of info would affect
read/writes as the collection grows. Would probably need a cache for
efficient 'dashboarding' of the inbox. Doesn't feel like its using
mongos strengths.

{
_id: 'object id',
stream id: 'shared id';
association: 'item1',
sender: 'user1',
recipient: [ 'user2', 'user3' ]
timestamp: 'timestamp',
subject : 'that thing',
message: 'users message',
status: [ { user3: 'unread'}, { user2: 'discarded timestamp' } ]//no
mention of user1 so app assumes it as read
}

Structure 3: recipient is the object. number of objects will never be
more than userbase. Duplicated content for each recipient. Can imagine
the queries being returned pretty sharpish, writes may not be so
great. Grouping all the messages for the associated document may be
tricky without a tracking collection.

{ _id: 'user1',
last viewed: 'timestamp',
messages: [ {
_id: 'object id',
stream id: 'shared id';
association: 'item2',
sender: 'user3',
recipient: [ 'user1' , 'user2' ]
timestamp: 'timestamp',
subject : 'that thing',
message: 'users message',
status: 'unread'
} , {
..... you get the idea
} ]

The main thing i'm not sure about is affectiveness of indexing and
queries but any input would be great.

thanks

Brandon Diamond

unread,

Dec 15, 2011, 12:25:26 PM12/15/11

to mongodb-user

I think your second design makes the most sense. Here's why:

MongoDB performs updates "in-place". When this is possible, updates
complete very quickly as they amount to a few bits being flipped in
primary memory. However, when an update grows a document, it may be
necessary to move the entire document to make room for the new data.

Generally, if growth is bounded by a small factor, MongoDB is able to
automatically compute a padding factor that minimizes the amount of
copying and provides for solid performance. However, if you're
updating a single document every time a message is sent, your updates
are destined to be quite slow.

In general, smaller documents with non-growing updates are to be
preferred. If you expect most conversations to be of the same length,
you can attempt to preallocate your documents -- but this will likely
lead to much wasted space.

Hope this helps,
- Brandon

gregorinator

unread,

Dec 15, 2011, 3:31:27 PM12/15/11

to mongod...@googlegroups.com

As Brandon has already pointed out, MongoDB isn't optimal when
documents frequently grow in size, especially when the growth is
unpredictable -- i.e., some documents grow a lot, while others grow
little. The example that's usually cited is a blog where comments
pour in by the tens of thousands, and some posts can get thousands of
comment while others may have none. In this case, comments should be
their own collection.

But that doesn't mean you must design your database so that documents
never grow. In the phrase "documents frequently grow", "frequently"
is a relative thing. When the volume of document shuffling is
manageable by the hardware, and especially when there's a high ratio
of reads to writes, it can also be important to take read patterns
into account when designing the database.

It doesn't sound to me as though your application will necessarily
have a massive volume of message creation -- though only you know for
sure. In the end, you have to apply your knowledge of how the
application is going to be used: If there will be many and frequent
message writes, I wholeheartedly agree with Brandon that you should
look at option 2. On the other hand, if the number of new messages is
manageable, and you want to optimize reads... then we should look at
your read requirements. It seems from your description that they are:

A. Populate the user's Inbox (and other message boxes)
B. Read messages
C. Disposition messages
D. Automatically discard old messages
E. Automatically archive messages and documents

To keep things simple, let's assume that D and E are batch processes
that can be done during quiet time, so we don't need to optimize for
those. Let's look at how each of your proposed designs handles A, B,
and C:

Option 1 is going to have the most trouble with A, populating the
user's message boxes, because it will have to do the most rummaging
around inside the document contents. MongoDB is fastest when it can
retrieve whole documents, so the optimum is to have each document
self-contained, and containing only what you need.

Option 2 will find it fairly easy and fast to meet requirement A if
you create an index on the user ids.

Option 3 makes it easy to meet requirement A because all you have to
do is read the entire document for the subject user. You can filter
the messages on the client side to populate all the message boxes --
Inbox, Read, etc. -- in one fell swoop.

What's going to be optimum for requirements B and C depends on the
answer to a crucial question: When a user wants to read a message, do
they want to see only that one message? Or do they want to see the
entire thread that contains that message, so they see the message in
full context.

If the user only wants to see that individual message, then you might
want to lean toward your option 2. Again, indexing on user id. You
can read the collection using the index to populate the message lists.
If you want, you can retrieve the message contents at the same time,
caching them on the client and displaying them when the user clicks on
a message. Or you can wait until the user clicks on a message to hit
the collection and retrieve the message. Either way, you're retieving
an entire message document -- Mongo's fastest way.

On the other hand, if the user wants to see the entire message thread
each time he or she reads a message, then I find it especially elegant
to consider a combination of your options 1 and 3. Each message
thread would be a document in a thread collection, as in your option
1, but without the user information. A separate option 3 collection
would contain a document for each user with the information relating
the user to each of his or her messages. You would populate the
user's message boxes on the client by retrieving the entire document
for the user from collection 3 (whole document = fast), and filtering
on the client to display Inbox, Read, etc. Then, when the user
selects a document, you would retrieve that entire message thread from
collection 1 (another whole document = fast) and display it.

Unlike RDMS's, where database design is almost mathematical, designing
a document database requires, first, knowledge of how the data is
going to be written and subsequently used, and, second, judgement.

I hope this has been some help,
gs

P.S. One more wrinkle regarding MongoDB performance and documents
that continually grow: MongoDB documents do not shrink when some of
their contents are deleted. That means that a document contents can
be frequently added to without a performance hit if it's also being
frequently deleted from, and the adds and deletes roughly offset each
other. Consider, for example, your option 3: As messages are added
for a user, the document will grow and Mongo will have to move it
around on disk. But at some point, messages will start to be deleted
or archived, and the space that that frees up will be available to be
reused by the next messages that are written. At some point,
presumably, a high-water mark will be reached, with equilibrium
between adds and deletes, and the document will cease to grow, so then
there won't be any more performance penalty.

J

unread,

Dec 15, 2011, 3:36:41 PM12/15/11

to mongodb-user

that does change thinking a little. I'm glad I found this out now,
would have caused headaches later on and not just with the messaging!

This does kind of contradict an insertion example in the mongo manual
though; where it shows comments being embedded under a blog post.
Thats kinda what contributed to structure 1 above. Surely embedding
comments in a post will have the same affect? I guess the write hit is
acceptable for the performance gain of embedding since the post and
comments will, in most cases, be read many more times than commented
on.

There should be a little note under that example explaining how
insertions and updates are affected by growing objects. I don't think
the padding factor section really makes this point that clear either.

thanks for the reply

J

unread,

Dec 16, 2011, 8:16:06 AM12/16/11

to mongodb-user

thanks for the detailed explanation greg, really helps. You must of
posted while I was typing my previous message.

Like the 1 and 3 combo, would still have the 'growing' issue on thread
objects but outweighed by slick reads and groups related messages.
Going to have to think about how a users inbox will be used and
abused.

It's an interesting point about an objects size reaching a certain
point and maintaining equillibrium in regards to option 3.

Would a document being moved (as in option 1) have much of a
performance difference to inserting a new object all together (option
2), considering option 2 will carry a bit more info as well?

On Dec 15, 8:31 pm, gregorinator <gregorina...@gmail.com> wrote:
> As Brandon has already pointed out, MongoDB isn't optimal when
> documents frequently grow in size, especially when the growth is
> unpredictable -- i.e., some documents grow a lot, while others grow
> little. The example that's usually cited is a blog where comments
> pour in by the tens of thousands, and some posts can get thousands of
> comment while others may have none. In this case, comments should be
> their own collection.
>

> But that doesn't mean you mustdesignyour database so that documents

> never grow. In the phrase "documents frequently grow", "frequently"
> is a relative thing. When the volume of document shuffling is
> manageable by the hardware, and especially when there's a high ratio
> of reads to writes, it can also be important to take read patterns
> into account when designing the database.
>
> It doesn't sound to me as though your application will necessarily
> have a massive volume of message creation -- though only you know for
> sure. In the end, you have to apply your knowledge of how the
> application is going to be used: If there will be many and frequent
> message writes, I wholeheartedly agree with Brandon that you should
> look at option 2. On the other hand, if the number of newmessagesis
> manageable, and you want to optimize reads... then we should look at
> your read requirements. It seems from your description that they are:
>
> A. Populate the user's Inbox (and other message boxes)
> B. Readmessages
> C. Dispositionmessages
> D. Automatically discard oldmessages

> E. Automatically archivemessagesand documents

>
> To keep things simple, let's assume that D and E are batch processes
> that can be done during quiet time, so we don't need to optimize for
> those. Let's look at how each of your proposed designs handles A, B,
> and C:
>
> Option 1 is going to have the most trouble with A, populating the
> user's message boxes, because it will have to do the most rummaging
> around inside the document contents. MongoDB is fastest when it can
> retrieve whole documents, so the optimum is to have each document
> self-contained, and containing only what you need.
>
> Option 2 will find it fairly easy and fast to meet requirement A if
> you create an index on the user ids.
>
> Option 3 makes it easy to meet requirement A because all you have to
> do is read the entire document for the subject user. You can filter

> themessageson the client side to populate all the message boxes --

> Inbox, Read, etc. -- in one fell swoop.
>
> What's going to be optimum for requirements B and C depends on the
> answer to a crucial question: When a user wants to read a message, do
> they want to see only that one message? Or do they want to see the
> entire thread that contains that message, so they see the message in
> full context.
>
> If the user only wants to see that individual message, then you might
> want to lean toward your option 2. Again, indexing on user id. You
> can read the collection using the index to populate the message lists.
> If you want, you can retrieve the message contents at the same time,
> caching them on the client and displaying them when the user clicks on
> a message. Or you can wait until the user clicks on a message to hit
> the collection and retrieve the message. Either way, you're retieving
> an entire message document -- Mongo's fastest way.
>
> On the other hand, if the user wants to see the entire message thread
> each time he or she reads a message, then I find it especially elegant
> to consider a combination of your options 1 and 3. Each message
> thread would be a document in a thread collection, as in your option
> 1, but without the user information. A separate option 3 collection
> would contain a document for each user with the information relating

> the user to each of his or hermessages. You would populate the

> user's message boxes on the client by retrieving the entire document
> for the user from collection 3 (whole document = fast), and filtering
> on the client to display Inbox, Read, etc. Then, when the user
> selects a document, you would retrieve that entire message thread from
> collection 1 (another whole document = fast) and display it.
>

> Unlike RDMS's, where databasedesignis almost mathematical, designing

> a document database requires, first, knowledge of how the data is
> going to be written and subsequently used, and, second, judgement.
>
> I hope this has been some help,
> gs
>
> P.S. One more wrinkle regarding MongoDB performance and documents
> that continually grow: MongoDB documents do not shrink when some of
> their contents are deleted. That means that a document contents can
> be frequently added to without a performance hit if it's also being
> frequently deleted from, and the adds and deletes roughly offset each

> other. Consider, for example, your option 3: Asmessagesare added

> for a user, the document will grow and Mongo will have to move it

> around on disk. But at some point,messageswill start to be deleted

> or archived, and the space that that frees up will be available to be

> reused by the nextmessagesthat are written. At some point,

Reply all

Reply to author

Forward

0 new messages