My experience hitting limits on Meteor performance

1,846 views
Skip to first unread message

Andrew Mao

unread,
Aug 6, 2014, 11:05:49 PM8/6/14
to meteo...@googlegroups.com
Today I had the misfortune of watching a Meteor server slowly grind to a halt under the weight of an essentially self-inflicted DDoS attack.

I have a very data-intensive collaborative realtime app (https://github.com/mizzao/CrowdMapper) and we tried connecting 120 users to it at once. After about 80 users, the CPU usage reached 100% - and stayed there. A sort of death spiral ensued as more client requests came in due to users mashing buttons and calling client-side methods, the server not being able to handle them quickly enough, and more button-mashing and a huge backlog of client operations that basically piled up. Eventually we just felt sorry for the server and killed it, and decided to resort to customer service/damage control with our clients instead.

Yes, the ideal solution to this would be to have a multiple-server cluster, etc. But several of the packages I'm using (user-status, partitioner, sharejs) won't work properly in that setting, and it would take some additional development time to make sure everything works properly as a distributed system. We just wanted to kick the tires a bit, and they crumpled. From the client's perspective, the app basically just looks unresponsive - methods execute locally but get backlogged, publications don't update, and if reloading, the initial bundle loads and just sits there because no data is received from the server.

This also speaks to the need for some built-in rate limiting as discussed in https://groups.google.com/forum/#!topic/meteor-talk/XyYhi8ZMgd8 ; not just for method calls but maybe for data publications too. Otherwise, a Meteor server that is trying to respond to many clients will quickly get backlogged, resulting in even more queued operations, and won't be able to recover.

Interested to hear if anyone else has run into these scenarios and their experience as well.

Arunoda Susiripala

unread,
Aug 6, 2014, 11:19:03 PM8/6/14
to meteo...@googlegroups.com
I'm quite not sure the exact reason. But I heard you were talking about that you are sending 200k document to the client. 

If that's the case this result does not surprise me. I hope meteor is not built for this kind of use case. If you are sending that much of documents to client, I hope you can optimize your app to not to do that. 

DDP is a realtime protocol, merge box and everything created to make it happen. If you don't use meteor's realtime features, try to ignore DDP and use rest endpoints. That might save your app. 

I agree, meteor needs some kind of rate limiting features and improvements but I don't think your app is a perfect usecase for meteor. 

--
You received this message because you are subscribed to the Google Groups "meteor-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to meteor-talk...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Andrew Mao

unread,
Aug 6, 2014, 11:49:26 PM8/6/14
to meteo...@googlegroups.com
I don't know where you heard that I was sending 200k documents to the client. You may have been thinking about this post (http://stackoverflow.com/a/21835534/586086), but that was a different app and the data was static there so we did not send it over DDP.

If you try to send 200k reasonably large documents over DDP, adding them to the merge box will bring the server CPU to 100% for several seconds even for ONE client. So there's no way that would scale.

At most my app sends 1.5k documents to the client. But it didn't even quite get there, people were doing a tutorial before the actual app itself and they only had 5 documents and a bunch of other method calls. It was just the simultaneous traffic that was too much. And yes, the documents have to be synced - they are not static.

Glasser mentioned that the merge box can be removed altogether on the server with a revamped publish API. So that can help a lot with CPU usage. Node's fast I/O doesn't really help Meteor because there is so much CPU-bound code in Javascript.

We've tried this app with 24 simultaneous users and it worked fine; tomorrow we're going to try with 32 and under slightly different conditions.
To unsubscribe from this group and stop receiving emails from it, send an email to meteor-talk+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arunoda Susiripala

unread,
Aug 7, 2014, 12:07:21 AM8/7/14
to meteo...@googlegroups.com, David Glasser
@Andrew

Okay. Have you done a Dtrace analysis of your app? Try to post a flamegraph here so we can really discuss the issue.

First of all: ignore node 0.10.30 if you are using it. It has a weird CPU issue.

Meteor UP supports meteor deployment for solaris. So create a smart os box with 2 cores and do the load test. Generate a flamegraph and post here.

@Glasser

What about the new published API. Any plans on that? if so when we could have it? (1.0/1.1/2.0)



To unsubscribe from this group and stop receiving emails from it, send an email to meteor-talk...@googlegroups.com.

Abigail Watson

unread,
Aug 7, 2014, 12:29:43 AM8/7/14
to meteo...@googlegroups.com
Does TurkServer.log write to disk, or to a collection?  If it's writing to a log file on disk, you might be blowing up your application with those TurkServer.log calls.  Console.log generally takes what?  5 to 10ms of access time?  Negligible compared to a 200ms network call, or a 100ms database call.  But when you start adding hundreds of users, those ms can add up.  

Andrew Mao

unread,
Aug 7, 2014, 1:25:42 AM8/7/14
to meteo...@googlegroups.com, gla...@meteor.com
How do I run Meteor 0.8.3 without node 10.30? By the way, app runs fine other other load conditions with small number of users; I don't think there are any non-integer setTimeouts in it.

Trying to profile the app while loaded is hard in my case, because it runs in batches with users we recruit from Amazon Mechanical Turk. I write publications pretty efficiently, and don't do anything stupid, so profiling tools are a bigger risk for introducing bugs or hurting performance than they help. I might try sometime when I code up load testing bots after this project is done, then I can try Kadira.

Abigail: TurkServer.log writes to an indexed collection. Mongodb is not stressed out at all, and no publications read the log, so Meteor should not be tracking this while the app is running. Logging has worked fine with around 60 simultaneous users.

Arunoda Susiripala

unread,
Aug 7, 2014, 1:40:51 AM8/7/14
to meteo...@googlegroups.com
Hi,

0.8.3 uses 0.10.29 not 0.10.30. 
BTW Dtrace is not a traditional profiling tool. You don't need to add anything into your app. It's something baked into node.

You may not use timeouts, but Meteor may use them inside meteor's retry package and they cause non-integer timeouts. So don't use 0.10.30 with meteor.

Andrew Mao

unread,
Aug 7, 2014, 4:13:05 PM8/7/14
to meteo...@googlegroups.com, David Glasser
Okay, today's run narrowed down the problem.

The CPU spike is caused by repeated `Collection.insert` operations when a lot of people are connected to the server. Doesn't seem to matter if they are actually subscribed to the collection, because the oplog watchers have to process them anyway.

I noticed Graeme Pyle talking about doing such inserts not inside of Meteor so as to speed things up: https://groups.google.com/forum/#!topic/meteor-talk/7GmsRHAb1EU. I'm going to try that and I think it should fix it.

The length of the CPU spike seems to be related to the number of records being inserted *times* the number of people connected. So it can get pretty bad in the scenario I originally described.

@Glasser - does this explanation seem to make sense or can you make more light of it?

Andrew Mao

unread,
Aug 7, 2014, 4:13:49 PM8/7/14
to meteo...@googlegroups.com, gla...@meteor.com
After the initial inserts happened, 32 people connected and doing stuff never had CPU go above 10%. So Meteor scalability is not the issue, whew! Just an idiosyncratic bug.

Andrew Mao

unread,
Aug 8, 2014, 1:23:33 AM8/8/14
to meteo...@googlegroups.com, gla...@meteor.com
I've been searching around and it seems that bulk inserts in Meteor can be a performance issue with how they interact with a bunch of oploggers or connections. Not sure which, but would be good to figure out the core issues.

References:


Going to try a bulk insert with the native Mongo API and see if it helps.

Slava Kim

unread,
Aug 8, 2014, 3:28:57 AM8/8/14
to meteo...@googlegroups.com, gla...@meteor.com
FWIW, if you have a collection with a lot of inserts, even if Meteor is not observing it, it still creates a big pressure on oplog (as Meteor would need to filter the noise out). I can't say if it would take much cpu though, but as a theory. Something like Oplog proxy would help.

Arunoda Susiripala

unread,
Aug 8, 2014, 3:40:14 AM8/8/14
to meteo...@googlegroups.com
There is another option. which is simple.
We should developer to specify which collections they need to watch for oplog.

We had some collections which only has writes but never read from that. But all of them came to meteor via oplog.
Because of that, we had to move that collection into a separate DB.

previous discussion on this: http://goo.gl/GyJ1U2

Andrew Mao

unread,
Aug 8, 2014, 10:41:02 AM8/8/14
to meteo...@googlegroups.com
I'm getting a pretty clear picture of what happened now.

I was using https://github.com/mizzao/meteor-partitioner, which indexes a collection and divides it up by many "groups". Queries for each group should be fast because of the index.

I had about 100 people connected, each of them was observing Collection.find({group: someRandomKey}) which has about 5 different documents per user. Then, I created a new group and tried to add 1,500 documents to it. Result: each of the 1,500 inserts passed through the oplog observers of the 100 other users, even though the inserts would not affect them and a direct database query would also be very quick. This resulted in processing 150k operations each time this happened which easily brought the CPU to its knees, especially since I did this a few times in quick succession.

1,500 inserts is not a lot, and I can't imagine it creating a lot of "oplog pressure" if the collection isn't being observed. But this seems like a common situation that would be a point of weakness for Meteor.

Imagine if you have a Posts collection on your site and hundreds of people are subscribed to posts as they are reading them on your site. Suddenly you insert 1,000 posts for some reason. Now your server is basically going to DDoS itself by trying to update all the observers.

Another way to look at this is that operations on a Meteor server take O(n^2) CPU time to process, where n is the number of connections to the server. An insert with 1 person connected takes 1 operation, an insert with 100 persons connected takes 100 times as long to update. 100 inserts with 100 persons will take 10,000 times long as the 1 operation. Even spreading this across a cluster of servers will not help without some sort of oplog filtering.

Slava - I know you are the right person to think about this problem and how we might mitigate it :)

Slava Kim

unread,
Aug 8, 2014, 6:29:08 PM8/8/14
to meteo...@googlegroups.com
Andrew,

As I mentioned above, there was a thought of having an oplog proxy layer, that would filter out the unnecessary or irrelevant oplog records. There also could be some improvements to oplog driver in Meteor, as far as I understand, each distinct observe will have a separate driver with a separate oplog tailor. I don't know that part of the driver really well so I am not sure how hard it would be or what priority it is right now prior to 1.0 (if any).

The difficulty with oplog proxy, for example, would be the inability to predict what collections does the app spawn. At least it should be able to cut all the unnecessary records not belonging to the app on the database level (all records of all actions across all databases in the cluster go through the same oplog database afaik).

Personally, I don't think that any time prior to 1.0 is a good time to mitigate those, but I didn't check on that with neither Glasser, nor Nick nor anyone else.


--
You received this message because you are subscribed to a topic in the Google Groups "meteor-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/meteor-talk/Y547Hh2z39Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to meteor-talk...@googlegroups.com.

Andrew Mao

unread,
Aug 10, 2014, 2:32:23 PM8/10/14
to meteo...@googlegroups.com
Agreed, a smart oplog filter, perhaps even below the JS level would cut down CPU usage significantly.

Currently we have the problem that if each user spawns O(m) oplog watchers and there are n users triggering O(n) database actions, then the Meteor server has to use O(mn) CPU operations *per* database operation, or O(mn^2) CPU per fixed unit time if we assume that each user hits the database the same rate over time. That will quickly lead to scaling problems.

This is better if some observers are multiplexed, but each user will inevitably have separate subscriptions in most cases.

Dave Workman

unread,
Sep 3, 2014, 1:01:15 PM9/3/14
to meteo...@googlegroups.com
Same problem was brought to my attention yesterday, only with less users and more inserts/updates/deletes. With 5 users, and 10k operatons my Meteor process hits 100% and stays there for at least a few minutes on my development machine, on our production machine it locks up with as little as one user and as little as 3000 operations, and it seems to lock up for quite a while longer (I haven't waited for it to finish, I've just been restarting Meteor). In our system it's unavoidable that some operations might affect thousands of documents so for us this is a big deal. What I've done for the moment is modified a method that originally could potentially affect thousands of documents, and am now doing that update in batches of 250, waiting a 1/4 of a second between updates and the site is staying responsive. That's one of about 5 locations that I'm going to need to apply a similar approach.

One (potentially temporary) solution could be if the MergeBox or oplog driver notices more than X (around 1000?) documents ahead of it in the oplog, stop trying to do the comparsions and do a complete unpublish and republish of all the users data.

I'd really love some sort of solution for 1.0!

Dave Workman

unread,
Sep 22, 2014, 2:56:28 PM9/22/14
to meteo...@googlegroups.com
I've resorted to adding the disable-oplog package on my application until this gets addressed. :(

Arunoda Susiripala

unread,
Sep 22, 2014, 5:52:22 PM9/22/14
to meteo...@googlegroups.com
Yes. You've a valid point. 
Here's an old discussion and the suggested solution: https://groups.google.com/forum/#!topic/meteor-core/RpTxiGPUhMw

We had the similar issue with Kadira, and now we are using two databases.
One for the app, which used for realtime updates via oplog.
One for metrics, where we don't need realtime updates via oplog.

That's why we got fixed this issue. But looking for a native solution.

--
You received this message because you are subscribed to the Google Groups "meteor-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to meteor-talk...@googlegroups.com.

Dave Workman

unread,
Sep 22, 2014, 6:14:10 PM9/22/14
to meteo...@googlegroups.com

I'm not sure the linked article contains a solution that would help me. The collection in question needs to still be real time. There are administrative tasks that modify thousands of documents but doing so causes the oplog tailor to do an enormous amount of work

What would be more realistic is to push the oplog timestamp forward and do a poll and diff on the current subscriptions.

You received this message because you are subscribed to a topic in the Google Groups "meteor-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/meteor-talk/Y547Hh2z39Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to meteor-talk...@googlegroups.com.

Dave Workman

unread,
Sep 22, 2014, 6:15:59 PM9/22/14
to meteo...@googlegroups.com

I hate that there's no edit on here:

In case its not obvious, I meant to push the oplog tailor timestamp forward and do a poll and diff if the oplog tailot notices there's a large number of oplog records ahead.

Reply all
Reply to author
Forward
0 new messages