Re: [mongodb-user] Creating a Facebook-like "Wall" in mongoDB - help in structuring it

1,010 views
Skip to first unread message

Sam Millman

unread,
Jul 1, 2012, 6:04:29 PM7/1/12
to mongod...@googlegroups.com
Joins are just done in place or all data grabbed at once. The only difference is that SQL does server side joins while you gotta do client side joins.

Personally I hold one collection, big table style. It hold an activity, this activity has a type which produces a type of caption and an object id and custom text (got this structure from using the fb api, this is the exact data they take in).

As I scroll through the records I collect the data via high range queries and then sort them out to the right documents client side, couldn't be simpler.

Now this isn't a notifications section but more of an activity stream right? Which means you won't need to seed a message per friend as activity connected to them occurs from another user? If that is the case then yes one collection called "activity" or whatever will fullfill your needs.

I mean other people have other ideas on how to build the schema for such a thing, but that is a method I have proven to work well and at the same time be scalable and easily sharded.

On 1 July 2012 22:20, Eric Mathiasson <e.math...@gmail.com> wrote:
Hi!
I'm thinking of creating a Facebook-like "Wall" in a project of mine.
Basicly, you as a user would have friends, and have the main wall to show all wall posts (all activity) done by your friends.
A user can comment and "like" a wall post done by other users.

I have a few ideas of how to do it in mongo, but I'm fairly new to it, and thus, don't know all possibilities.
The "big problem" that I see - is that you cannot really do a RDMBS join in mongo, and thus, need to design the system in a different way.
So I'm was just wondering if you could share some ideas of how to go by with it?
I'm thinking in the ways of; "How many collections", "What to store in each collection", "Embedding or linking between collections" etc.

All tips and help are welcome, I need to do it "right" from the start, so it's best for me to get your advice first, I thought :-)

Regards,
E.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Sam Millman

unread,
Jul 1, 2012, 6:06:42 PM7/1/12
to mongod...@googlegroups.com
As an edit:

If course recently fb has gone to trying to store as much as possible within their activity row which means it never does a join to get extra data...

Sam Millman

unread,
Jul 1, 2012, 6:10:37 PM7/1/12
to mongod...@googlegroups.com
As a further edit:

That is of course different for wall posts.

Wall posts will require joins to get the comments and the shares and all that, but if they have intelligence (which I suspect they do) they will have a pre-aggregated field and what not for the share and all that and only ever grab comments from a join unless they query those separately too.

In Mongo of course you could house the comments in a sub document since there shouldn't be more than 8k of those comments (hopefully) and they should be quite short so in your collection you could have two types of activity:

- activity
- and wall post

Eric Mathiasson

unread,
Jul 4, 2012, 4:55:51 AM7/4/12
to mongod...@googlegroups.com
Edit:
Or do you mean that I scroll through the entire big table which holds ALL activities? And look for those that belongs only to my friends?
That soulds kind of in-efficient.. imagine if I got 1M members on the site and need to go through millions of wall posts to get my friends?

This has to be done in real-time as well, so it cannot take longer than ~0.5s to render the main wall.

Sam Millman

unread,
Jul 4, 2012, 5:21:35 AM7/4/12
to mongod...@googlegroups.com
Ok well the structure I use for each row is (a simplified version, very simplified):

{
   _id: {},
   title: 'has uploaded a new video',
   user_id: {},
   picture: 'http://sximg.com/thumb_556665.png',
   caption: 'A video of me',
   link: 'http://videos.stagex.co.uk/watch?id=59596',
   text: '',
   location: [] // Would be long/lat,
   with: [] // Would be an array of ObjectIds of users
   // etc
}

Since Mongo is so good at streaming fresh results from DB without having to try and concat everything in to a join lets ignore the joins that this doc would create.

Our first query would be to, say, get a single users stream. So I would just query for a limitation of 20 docs initially on user_id and then use the last _id you saw to get the next 20 records when the user needs to.

Since Mongo is good at streaming (as I said) when we have these 20 we can just inplace do the joins across the tables to get our info.

To get all, say, subscription wall posts for a user (everyone they are following). You would query the relation of following on the user to get back a list of user_ids that the user is following then you'd do a $in query on the user_id of that document.

That is personally how I would do it. As I said before you can sort stuff client side etc but I am not doing that here.


"Or do you mean that I scroll through the entire big table which holds ALL activities? And look for those that belongs only to my friends?
That soulds kind of in-efficient.. imagine if I got 1M members on the site and need to go through millions of wall posts to get my friends?"

How else were you thinking of doing it? I mean if you use indexes you should be able to go through quite quickly. I suppose you could split the collection down but then your getting into the comlexities of like manual sharding which is just over kill. I suppose you could use a sub document structure here but then your looking at document size limitations, easily (I did too). Evne in SQL you would be forced to house a table called "activities" and query on that.


"This has to be done in real-time as well, so it cannot take longer than ~0.5s to render the main wall."

I don't think anyone is able to do that in realtime for huge collections, maybe for upto 100m rows but past that....you would have to use some HTML caching structure at that size. You would sanely only show maybe the latest 20 activity posts for a user and refresh that list every 5 mins.

Saar Korren

unread,
Jul 4, 2012, 7:34:36 AM7/4/12
to mongod...@googlegroups.com
Okay, first, let's try to refine the question a bit, since you weren't specific enough in what you're trying to accomplish.

Now, "Wall" might refer to two things:
-The user's own status and posts and activities, plus maybe comments.
-The "News feed" which contains activities of friends and subscribed channels.

Now, the prior is actually trivial and shouldn't require any joins. You can hold an entire post, of any type, along with all related information (comments, pictures, etc) in a single document. If you're expecting a lot of comments on a single post, you can split those into a second collection, with an index over a field which contains the "parent post" which is commented. You can then query for those when displaying the posts.

The latter, however, can be accomplished in two methods, which I should address separately:

-On demand
You iterate through your friends and subscribed channels, deriving recent news from each, and then aggregating them into a single list. You can get some optimization on that by using existing results to minimize the maximal date for subsequent queries. However, as you've probably noted, this can get VERY inefficient if your subscription count becomes large.

(Small note: You'd be getting similar issues with this method if you've used MySQL JOIN. It's just not efficient)

-Pre-aggregate on-update
You keep a collection for holding your "News feed". Whenever a post is made by a subscribable (Yes, I just made up a word for it), a copy of the post is written to the feeds of all subscribed users. If you don't like the duplicate information, you can simply write dates and ids which reference the original posts.
This requires holding an index by subscribable on the subscription collection, naturally.
Also, while this does create an additional load during posting (Since the subscribers are iterated), this is still significantly less compared to iterating the posts of multiple users on each read.

If you choose to hold reference ids instead of entire posts, you will need an additional query to retrieve the post on-display. This is similar to a join. Being done per-id, this should not add any significant load. In case where posts may be (permanently) deleted by the poster, you can do on-demand cleanup of the aggregated collection during the query (And that's something a join would not do for you).

This method also adds the possibility of removing the item from a single user's news feed (e.g. The "hide" button in Facebook).

-Hybrid system and lazy aggregation

Where there are two methods, you can usually get a "best of both" solution. You can create a collection for "News feed"s, but you don't necessarily need to update it when a post is made.
You can use it as a cache, and update it on-demand, drawing only the new posts by each subscribable, using the existing posts in the cache as a "minimal id" constraint on the query. This makes writes fast, and optimizes reads by ensuring the load caused by each post is applied only once.

.Instead of updating the News feed on-demand, or on-write, you can also update it in a background task. This does mean that a post made would not instantly appear on the feed, but it also causes the least load time and delays for both subscriber and subscribable. If there to be many subscriptions for each subscriber and each subscribable, then this is the best method.
You would need to create an additional collection to act as a "job queue", receiving inserts for every post made. It would slowly iterate through its job queue, updating the various News feeds as necessary.

Please note one thing: The distinction between all of these method would still exist in SQL. It would simply be required to achieve performance thresholds, since relying blindly on joins can easily bring your server to a crawl.

Sam Millman

unread,
Jul 4, 2012, 7:50:55 AM7/4/12
to mongod...@googlegroups.com
"You keep a collection for holding your "News feed". Whenever a post is made by a subscribable (Yes, I just made up a word for it), a copy of the post is written to the feeds of all subscribed users. If you don't like the duplicate information, you can simply write dates and ids which reference the original posts.
This requires holding an index by subscribable on the subscription collection, naturally."

That's efficient for notifications but for wall posts that is a highly inefficient use of space and querying....

I mean if your worried about querying over millions of rows then what about billions cos duplicating the post for every combination of friend and attached user would quickly expand to that number...That is the reason I disagree with that approach, your spending will spiral out of control not only that but your system will be taking so much load from writing that reads will become strenorous at best to do.

I have heard other people suggest this approach and I agree it works for a notifications system but for a fb wall....

--

Sam Millman

unread,
Jul 4, 2012, 7:56:01 AM7/4/12
to mongod...@googlegroups.com
As an example take top gear fb page.

Everytime you send a status their page you would be duplicating to 5m subscribers?? That's a LOT of writes and already your activities collection to aggregate (or whatever collection you house this in, even as a suboducment) will become very large and the locks on it very hairy. Consider they can pump out 15 status updates a day.....lol I think you can crunch those numbers....

Sam Millman

unread,
Jul 4, 2012, 8:01:46 AM7/4/12
to mongod...@googlegroups.com
"-Hybrid system and lazy aggregation"

I guess this is just a lazy version of the method you mentioned before? Again I put same troubles to it as with the method above it. It's just lazier which means load on the server can be managed...just, locks would still be questionable.

I mean imagine having 100 top gears all pumping out 15 status' a day and then all the rest of your users with an average of 200 friends....you have close to 4b rows a day on average (when I tested under my scenario)....that's why I went for the first method, it's the only one that made sense to me.

Saar Korren

unread,
Jul 5, 2012, 6:08:36 AM7/5/12
to mongod...@googlegroups.com
I think you have some sort of misunderstanding about the difference between "querying" over a billion rows, and "iterating" over a billion rows.

If you have a row for every post<->subscriber pair, and you have an index over the subscriber, and you query for posts the subscriber should see, the query would simply use the index, and only go over the posts to which the user is subscribed. In other words, it would NOT go over billions of rows. In fact, if your index uses the post date, descending, as a second key, it would only go over the posts returned. In other words, instead of going over millions of posts, you go over about 20 or 50 (depending on how many posts you show per page).

Now, if you didn't have that collection, and tried to compute things on-demand, you would have to go over the latest 20 or 50 posts per-subscribable, so if you have 5000 subscriptions then you are now iterating over 250000 posts for every page view!

This is the same if you used MySQL. Doing the query using a JOIN would result in a query that iterates 250000 posts (While locking both the subscriptions and posts tables), if not more, resulting in a query that runs VERY slow, and can easily take anywhere between several seconds to several minutes (!) to complete.

On the topic of locks, you seem to have mentioned them several times, which shows me that you have some sort of misconception as to when and for how long collections are locked. The method I described actually rises practically no cause for mutual exclusion. Querying is performed very fast, and the read-locks held don't even block each other. The write lock used to update the helper collection is only held sporadically - for each row added, so it does not block queries and other operations either.
This is a feature of MongoDB that is in contrast to MySQL which would, in fact, lock the whole table for any such query. However, even in MySQL, a lock held on two tables during a JOIN would create MUCH greater exclusion compared to ones held sporadically during fast queries and insertions.

The fact is, you don't need to manage any special locking here. The queries are very simple. A query by index for read. Simple row-by-row insertion for write (Doesn't even read any data).

The REAL cost, as you have hinted, is that the helper collection would be quite large. This is a space vs performance issue. You either pay the cost in space, or you have a site that constantly freezes trying to aggregate on-demand with no caching.
You can partially alleviate this by having the aggregated collection only hold references to the real posts, rather than actual content. This should reduce individual document sizes.
And don't forget that you can shard the collection by subscriber, and thereby have it stored in several data-centers, reducing the space (and CPU) load on each.

As for the number crunching: You seem to be very worried about the writes throughput, but have you considered reads? Imagine how often a user refreshes his wall. This is a lot more than the number of status posts made, moreso if you consider things like AJAX queries to update new posts in real time (Once a minute vs 15 a day? No contest). Imagine if each of these times, the amount of posts displayed (again, about 20-50 in a normal page), are read from EACH of the feeds to which the user is subscribed.
Think of that top gear page you mentioned: Instead of writing 1 post 15 times a day for each subscriber, it now has 20-50 posts read from it hundreds, if not thousands of times by each subscriber per day.
You can argue about the efficiency of reads vs writes, but that difference is marginal when faced with such a huge difference in quantity. You'd basically be driving your CPU to the ground.

Now, admittedly, you can optimize the reads a bit by more carefully iterating through each subscribed feed, doing a form of sorted array merge. You would still be left with a LOT more rows read than were written in the caching method, per day.

Incidentally: You CAN save some space by capping the maximal length of the news feed per user. That is, you can decide that a user cannot view news past a certain page, or that are older than a certain date. Again: This saves space, not performance.

Sam Millman

unread,
Jul 5, 2012, 6:34:24 AM7/5/12
to mongod...@googlegroups.com
"If you have a row for every post<->subscriber pair, and you have an index over the subscriber, and you query for posts the subscriber should see, the query would simply use the index, and only go over the posts to which the user is subscribed. In other words, it would NOT go over billions of rows. In fact, if your index uses the post date, descending, as a second key, it would only go over the posts returned. In other words, instead of going over millions of posts, you go over about 20 or 50 (depending on how many posts you show per page)."

Indeed, but the original poster didn't like the idea of querying over millions when I pasted it to him either because the collection might have millions of rows, so I said "what about billions?"

Fact is using an index billions should be fine so long as you are not iterating over those billions.


"Now, if you didn't have that collection, and tried to compute things on-demand, you would have to go over the latest 20 or 50 posts per-subscribable, so if you have 5000 subscriptions then you are now iterating over 250000 posts for every page view!"

You would just use an index to only get 20 of the latest via a set of subscriber ids. It is no different than querying by one user_id because you would have the same number of docs in the final result fair enough nscanned could be high but there are ways to solve that.


"Querying is performed very fast, and the read-locks held don't even block each other. The write lock used to update the helper collection is only held sporadically - for each row added, so it does not block queries and other operations either."

Not on a low simultaneous connection cluster.


"This is a feature of MongoDB that is in contrast to MySQL which would, in fact, lock the whole table for any such query. However, even in MySQL, a lock held on two tables during a JOIN would create MUCH greater exclusion compared to ones held sporadically during fast queries and insertions."

Doesn't Mongo work on Global locks and in 2.2 DB level locks for writes etc? Fair enough your theory is correct in maybe 2.3 or later whenever 10gen decide to implement collection level locking.


"You seem to be very worried about the writes throughput, but have you considered reads? Imagine how often a user refreshes his wall."

Consider that fb and twitter only poll for new informatoin every 3 mins. You also gotta consider that in 90% of cases, as I said, it can be done via a simple $in on subscriber_id making the query extremely fast. Fair enough a little slower than a single user_id but with a time range added you can severly lower the risk of nscanned being too high (Think of fb timeline..).


"Think of that top gear page you mentioned: Instead of writing 1 post 15 times a day for each subscriber, it now has 20-50 posts read from it hundreds, if not thousands of times by each subscriber per day."

That's why sites use memcache.


"Incidentally: You CAN save some space by capping the maximal length of the news feed per user. That is, you can decide that a user cannot view news past a certain page, or that are older than a certain date. Again: This saves space, not performance."

I would like to think this not even a thought, fb have now made it so you can see even back to the days of facemash so I think we gotta consider that a new feed should be untethered.

So your saying your cluster could easily handle 5b writes a go? That must be a pretty mig cluster...I mean you can ease the tension of reading via cache, it is harder to ease the tension of writes since you must write a some point, you have no choice and the data is not there untill you write...


--

Sam Millman

unread,
Jul 5, 2012, 6:56:54 AM7/5/12
to mongod...@googlegroups.com
I also think that a 5b write operation would no subside to other ops as easily as you claim.

It still creates a work queue on the DB end pushing that amount of ops and considering the lock atm operations could still become "clogged". I mean if you were to define the hardware upon which you think this would run while releasing locks in time for other ops so as to not create a slow down for the end user that would help.

Also you gotta consider that if you pause writes to let up resources to create better locking then you are immediately in threat of losing that data in an area out side of the journal or DBs control. I don't personally like the sound of that, also where would you house such a cache for writing?

I mean the other alternative of course is to house a job table (as you said in your reply) but then I just realised how are you gona keep track of that job? Maybe through subscriber user_id ranges? So you say take from user_id 4 and limit 20, store last id and process again until all subscribers are completed. This means you are iterating over 5m records to complete the writes of 5m records. So you are getting the worst of both worlds there if you try and delay writes. In fact to write to each wall you would need to know the walls owner in relation to the subscription (like top gear) as such you would be iterating over 5m records to understand whether or not to write 5m records.

It is questions like this that make me say that reads are easier to hold than writes.

Sam Millman

unread,
Jul 5, 2012, 7:02:26 AM7/5/12
to mongod...@googlegroups.com
"I also think that a 5b write operation would no subside to other ops as easily as you claim."

I meant 5m not b

ddorian

unread,
Jul 5, 2012, 7:14:06 AM7/5/12
to mongod...@googlegroups.com
If i'm not mistaken:
When new post is created:
1.Find this user's subscribers (the users that have subscribed to this user).
2.Save job in rabbitmq/mongo collection.(the job can't be lost i think)
3.Read the job offline and do an "in query" to "push" the id of the post that was created.(multi update, )
4.I think it would be better to "find this user subscribers" to be on the offline part,so no space is required to save the ids in the job queue.

When watching your wall:
1.Get user wall, slice top 20
2.Make "in" query and sort clientside.

What happens when someone deletes a post? Do you go to everywall and "pull" the id. Or just leave it and solve it clientside when you are making the "in" query?

Eric Mathiasson

unread,
Jul 5, 2012, 7:30:50 AM7/5/12
to mongod...@googlegroups.com
Hey!
Woa, what a long thread this is now, BIG THANK YOU for all your replies!

I think I've decided to go with the "pre-aggregate" method anyway, that makes more sense to me, to it be a little bit delayed on write, but instant on read (which it HAS to be since its in real-time).
I'll be using memcached as well to store them in a cache - and therefore, read them even faster. And use mongoDB as the persistent storage.
I haven't played around that much with mongo yet to have it to benchmark my needs, maybe its fast enough already. We'll see!

Don't know yet if I will implement the lazy-aggregate, I'll see, it's deffinately a good way to reduce some load on the servers.
The hybrid system sounds tasty as well, need to make sure I can read the newest posts from all subscribables fast enough tho to not make it freeze to load the main wall.


Once again. You've been to great help, all of you!
I think I finally got the picture of how to go by with it.

XOXO

Sam Millman

unread,
Jul 5, 2012, 7:30:57 AM7/5/12
to mongod...@googlegroups.com
Depends upon which method you are going for

dorian i

unread,
Jul 5, 2012, 7:41:58 AM7/5/12
to mongod...@googlegroups.com
@Sam
I was talking about the Aggregate method.

Also looks like he needs to pre-allocate the document since it will be ever increasing.(i think like 16mb could get you 1,000,000 objectids)

Also looks (i looked very little at this) you can use cassandra for the wall collection (http://stackoverflow.com/questions/9741522/cassandra-performance-for-long-rows). Because cassandra has better write scalability,hash based sharding(to be in mongodb soon). But then we are in mongodb forum and that would be another database to maintain.

Sam Millman

unread,
Jul 5, 2012, 7:54:25 AM7/5/12
to mongod...@googlegroups.com
"Also looks like he needs to pre-allocate the document since it will be ever increasing.(i think like 16mb could get you 1,000,000 objectids)"

Some people get speed problems querying over 200 objectIds in a single subdocument so not sure how you expect to query over 1m... I am also not sure about the 1m since you gotta think about the padding factor and the object (ObjectId()).

Also as a note: You cannot range query and extract the subdocument server side which mens you would have to bring those 1m elements backa nd filter them yourself unless of course to wish to try and deal with the speed problems of a skip() limit() type $slice.

If you were to house the objectIds in batches within the document that would be better and might work.

Sam Millman

unread,
Jul 5, 2012, 7:55:08 AM7/5/12
to mongod...@googlegroups.com
In fact pre-paging the objectIds in the root document into groups of 20 would work but then that stagnates how you can query...

dorian i

unread,
Jul 5, 2012, 8:04:25 AM7/5/12
to mongod...@googlegroups.com
I think you misunderstood me. Here is the wall document (should be preallocated much larger so when you do a push the document will not be pushed, and since you will only push to this):

{
_id: whatever,
posts:{
{id:id},
{id:id}(max 1 million of these.)
}
1. get query using slice[20] on posts array of subdocuments (http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-RetrievingaSubrangeofArrayElements)
2. Using the data from the slice make an "in query". Sort the documents returned from the "in" query by comparing to the stuff that your got from the slice query.

But. Suppose you have a 10mb document but you only slice the top 20, (like 1kb?) does all the document have to be in memory?

1. Get

Sam Millman

unread,
Jul 5, 2012, 8:09:16 AM7/5/12
to mongod...@googlegroups.com
Yes your speed problem comes right here:


posts:{
{id:id},
{id:id}(max 1 million of these.)
}

Say the user wanted, on  fb, to look at posts from 2009 (which they can)? No only that but to get more pages, say more than 20...(I mean come on even I look at more than 20 latest posts on fb) you would need a skip() limit() type $slice.


"But. Suppose you have a 10mb document but you only slice the top 20, (like 1kb?) does all the document have to be in memory?"

Yes, I think.

Sam Millman

unread,
Jul 5, 2012, 8:12:54 AM7/5/12
to mongod...@googlegroups.com
Not to metion updating that subdocument would be nasty, as I said people have found problems after 200 id's in a subdocument so not sure how you expect to manage those 1m...

dorian i

unread,
Jul 5, 2012, 8:24:56 AM7/5/12
to mongod...@googlegroups.com
So people have found problem when pushing after 200? I see. Wonder how that works internally, is the whole array moved(possibly).

What do you think about the cassandra link which i posted earlier?(looks like they have a 2billion max number)

Sam Millman

unread,
Jul 5, 2012, 8:28:03 AM7/5/12
to mongod...@googlegroups.com
I doubt cassandra can go that far in reality.

I have used cassandra and it isn't much more scalable than MongoDB.

Everyone talks about theory but reality is always different, like in theory Mongo can handle 1m ObjectIds in that subdocument but I know from reading this user group that people in reality have found problems after 200 rows in a subdocument.

Theory is never reality and you should never code for theory.

Sam Millman

unread,
Jul 5, 2012, 8:32:18 AM7/5/12
to mongod...@googlegroups.com
As noted from the SO link:

"It's supposed to be able to support up to 2 billion columns per row, but at this rate of increase performance will mean that it can't be used for very long rows in a practical situation."

Basically admitting in reality due to peformance the limit and the practical limit are very different.

And in the answer he also notes differences in speed.

The size isn't the problem, its the speed.

Saar Korren

unread,
Jul 5, 2012, 10:57:31 AM7/5/12
to mongod...@googlegroups.com
"You would just use an index to only get 20 of the latest via a set of subscriber ids. It is no different than querying by one user_id because you would have the same number of docs in the final result fair enough nscanned could be high but there are ways to solve that."
Querying by a set that is 5000-items large is like making 5000 queries. It is not like a single query. And again, this is assuming a Facebook-like limitation of 5000 on the number of subscriptions per user.

On the topic of locks, remember that if you insert 5m rows, MongoDB does not hold the lock for 5m inserts, it holds and releases the lock for every insert, allowing reads to occur at the same time. Overall, it is a much less expensive operation this way.

Regarding missing posts: They can be removed during iterations, although that requires determining the limit of the cursor after iteration begins. I'm not entirely sure if that achieves performance comparable to a preset limit.

Sam Millman

unread,
Jul 5, 2012, 11:27:06 AM7/5/12
to mongod...@googlegroups.com
"On the topic of locks, remember that if you insert 5m rows, MongoDB does not hold the lock for 5m inserts, it holds and releases the lock for every insert, allowing reads to occur at the same time. Overall, it is a much less expensive operation this way."

My idea of it is one big collection all sorted on user_id using an index to grab all posts by a subset of 20 users limited by 20, I am failing to understand where this 5k queries comes from?

--

Sam Millman

unread,
Jul 5, 2012, 11:36:38 AM7/5/12
to mongod...@googlegroups.com
"On the topic of locks, remember that if you insert 5m rows, MongoDB does not hold the lock for 5m inserts, it holds and releases the lock for every insert, allowing reads to occur at the same time. Overall, it is a much less expensive operation this way."

Yes but throwing that many inserts in such a small period of (even in a hour). I know my cluster wouldn't be happy doing that even twice (for 2 users) simultaneously and I have pretty much the standard setup on AWS.

As well I imagine a user having 5000 subscribers but not 5000 subscriptions, I cannot think of any fb user or twitter user with  that many subscriptions.

Saar Korren

unread,
Jul 8, 2012, 4:44:51 AM7/8/12
to mongod...@googlegroups.com
Your search is not by a subset of 20 users, but by a subset of 5000 users. 5000 subscriptions is not "out there". Half the people in the office where I work have about that much.

Every friend you have on Facebook is a subscription (Their activities appear on your news feed), and Facebook intentionally limit your friends list to 5000 friends. It is VERY common for people to hit that limit, or at LEAST reach the 200-500 range. I don't know where you got the assumption that people would only have 20 subscriptions. That's ridiculous. That number is more common from people who don't even like the service, and use it VERY reluctantly.
Heck, even I have more than that on my Facebook, and I visit it once in a blue moon.

Now re-envision your prior query, executed, by your own statement, once every 3 minutes, with a subset of 5000 users, for every user. Still think that's particularly efficient?

Now let's at least put the math on the table:
Suppose you have u users
Each user has s subscriptions
Each subscription makes p posts per day
Each user refreshes their news page r refreshes per day
The news page show n posts per page

You can either have u*s*n*r scanned rows per day (which require instant response), or u*s*p inserts per day (which can be sequentialized  using a worker queue) plus u*r*n scanned rows (just to be fair).
To the latter, add that there's a collection with a row for every such insert, as an added space cost (not performance).

Now, I think that if you pick any realistic numbers, you will find that n*r is greater than p by orders of magnitude, and even s*n*r is that much greater than s*p+ n*r. Even if you say that an insert (even a sequential one) is more expensive than a read (even an asynchronous one), when it comes to this big a difference in quantity, it's no-contest.

It all comes down to CPU vs disk space. Either pay the space by making the extra collection, or drive your CPU load to the roof with the extra queries.
Reply all
Reply to author
Forward
0 new messages