SIGKILL - more details on when node dies?

Tim Heckel

unread,

Jan 21, 2014, 12:20:17 PM1/21/14

to meteo...@googlegroups.com

I'm seeing these show up in our forever logs, and am hopeful I can get some more detail on when this happens -- I'm not sure what's causing the underlying node process to die. Is there any additional logging that Meteor surfaces that may help shed some light on this?

data: /var/www/XXX/bundle/main.js:13190 - error: Forever detected script was killed by signal: SIGKILL

data: /var/www/XXX/bundle/main.js:13190 - error: Forever restarting script for 15 time

Nick Martin

unread,

Jan 21, 2014, 9:03:39 PM1/21/14

to meteo...@googlegroups.com

Hrm... sounds like something external to meteor is killing your process. SIGKILL is not a catchable signal, so I don't think you'll be able to print anything from node.

--
You received this message because you are subscribed to the Google Groups "meteor-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to meteor-core...@googlegroups.com.
To post to this group, send email to meteo...@googlegroups.com.
Visit this group at http://groups.google.com/group/meteor-core.
For more options, visit https://groups.google.com/groups/opt_out.

Tim Heckel

unread,

Jan 22, 2014, 11:37:28 AM1/22/14

to meteo...@googlegroups.com

Aha...thanks Nick. Is there anything inside Meteor that would cause it to kill itself? Morbid.

The nodejitsu guys don't know of any way to surface WHAT kills the script at this point. After moving to pure oplog tailing, there is tons of capacity on the web servers. I don't see a correlation between node dying and cpu utilization. Is there any way that a constrained ulimit (default of 1024) could cause the OS to kill node? There does seem to be a positive correlation between the number of users on a given server and SIGKILL firing...

Tim Heckel

unread,

Jan 22, 2014, 11:52:04 AM1/22/14

to meteo...@googlegroups.com

Ah, nevermind my ulimit theory. I have one server with a high ulimit count (65536) and the forever logs still surface the SIGKILL dilemma.

Tim Heckel

unread,

Jan 22, 2014, 3:03:45 PM1/22/14

to meteo...@googlegroups.com

Found it -- I saw these in the log:

node invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

And did a free -h to check the memory and there was only 100 mbs left. We were running on a c1.xlarge with 8 node processes and 7 gb of ram. It looks like there is a hard limit of 1 GB per node process.

Prior to oplog tailing, we had tons of free memory. Now that the web servers are keeping pace, Meteor is caching a ton more queries and the memory issue is surfacing. Before oplog, it was always a CPU concern.

We could have opted to shut down a couple of the cores to preserve more memory for the others, or launch a newer instance with higher memory capacity. Right now we've put up m1.xlarge(s) with 15 gb of memory, running 4 cores and have offloaded the c1.xlarge(s). Now I can see these new servers at between 3-4 gbs, with plenty of free memory.

Because we do send a fair amount of data down to the client, I'm wondering what you think about us tweaking the default 1GB limit per the above link? Do you guys have any thoughts on that? I'd like to use as much of the 15 GB as possible for caching. If that's not feasible (or helpful), then we should take another look at the cpu/memory options on Amazon and find a better fit.

I'm glad we found out the culprit -- it's so funny, how these pain points emerge as the framework improves! Keep it up :)

Nick Martin

unread,

Jan 23, 2014, 2:34:20 AM1/23/14

to meteo...@googlegroups.com

Hi Tim,

Out of memory makes a lot of sense to explain the SIGKILLs.

I'm not sure about the 1GB limit per node process. We routinely see node processes using over 1GB of memory. The top one on meteor deploy right now is using 2.8GB of RAM. Also, if this were an internal v8 limit, I'd wouldn't expect to see the linux oom killer messages. Can you confirm that it is the linux OOM killer firing, not something in v8? I believe if it shows up in the output of 'dmesg' it means it is a linux kernel killing.

Are you using any swap space on your instances? I've found that linux's OOM killer will fire much less often if you add even a little swap space. For example, we saw a marked improvement when we added just 1GB of swap to our runner machines with 68GB of RAM. Here's an article that talks about this a little: http://www.linuxjournal.com/article/10678. You don't need a whole swap partition, either, we just put a small swap file on /.

I do expect 0.7.0 to use more memory than previous versions, both with and without the oplog, because of some changes to how data is cached. However, not _that_ much more, around 50% or so. Are you running significantly fewer processes than before? As you run fewer processes, each one handles more sessions and uses more RAM.

Anything you can do to limit the data sent to each client will help. Our profiling shows parsing the bson data from mongo is typically a hotspot in apps. So using the the 'fields' specifier on queries or selecting fewer documents will save both CPU and RAM.

Cheers,

-- Nick

Tim Heckel

unread,

Jan 24, 2014, 9:35:16 AM1/24/14

to meteo...@googlegroups.com

Nick -- it looks like it was the kernel, but I guess I'm not 100% sure:

Jan 22 10:41:47 ip-10-31-196-68 kernel: [493895.113634] node invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

Jan 22 10:41:47 ip-10-31-196-68 kernel: [493895.113738] [<ffffffff81106acd>] oom_kill_process.part.4.constprop.5+0x13d/0x280

Jan 22 10:44:12 ip-10-31-196-68 kernel: [494039.841765] nginx invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

Of course we were getting very strange errors -- sometimes a bad gateway if oom killed nginx; sometimes just a faulty node process (that forever would restart). Out of memory explained everything, and since upgrading we haven't seen a single problem in the forever logs.

I think you're right about node going above 1GB -- do you add any params to forever/node when you started it? Or does it just grow to the capacity it needs? I suppose I can do more research there. In our case, we had eight node proceses sharing 7 GB of ram. Add to that mix how we are loading up the client initially with most of their data, and there is a healthy amount of data being cached.

Now, we are running 4 processes sharing 15 GB of RAM. We haven't seen a spike in usage that would cause the RAM to spill over 1GB per process, but I'm sure we will, possibly next week.

Thanks for the tips for the swap file and the 'fields' specifier -- we have done of the field specifying, but have not looked into the swap file business at all. As far as CPU utlization goes, oplog tailing has made a dramatic difference -- our webservers are down from 60-80% consistent utilization to between 5-15%.

Nick Martin

unread,

Jan 25, 2014, 1:53:20 AM1/25/14

to meteo...@googlegroups.com

Yeah, that sure looks like the linux OOM killer. And thats consistent with the weird behavior you were seeing. Linux gets really sad when it runs out of RAM.

We don't add any node parameters or custom settings when we run apps on meteor deploy servers. We do up the ulimit for file descriptors on the machines, which is important for our proxy and for apps that handle more than 1024 simultaneous connections. We also add swap to the machines and set /proc/sys/vm/swappiness to 5, which helps prevent the OOM killer from firing but discourages linux from preemptively swapping things out.

Really glad to hear oplog tailing is working for you! It's great to see it actually being used and helping real people scale real sites =)