100% CPU usage with node, EMFILE, CLOSE_WAIT

1,604 views
Skip to first unread message

Qasim Zaidi

unread,
May 28, 2013, 2:17:09 AM5/28/13
to nod...@googlegroups.com
Hi,

Ever since we switched to 0.10.x in production, we are seeing instances where the server gets into a loop and CPU usage shoots to 100%. Here is what I have been able to understand so far about this situation.

- Something (maybe traffic spike) triggers a socket leak in the process, and we start with getting a lot of descriptors in close_wait state.  This is very similar to what is reported at https://github.com/einaros/ws/issues/180 (Connections stay in CLOSE_WAIT with Node 0.10.x). 

- After some time, we run out of descriptors, i.e. the nofile limit is reached. This triggers the 100% CPU loop, and node becomes unresponsive.

- Here is the gdb stacktrace for this process

#0  0x00007f2ae2b57ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000006e063b in uv__accept4 (fd=fd@entry=15, addr=addr@entry=0x0, addrlen=addrlen@entry=0x0, flags=flags@entry=526336)
    at ../deps/uv/src/unix/linux-syscalls.c:225
#2  0x00000000006d310c in uv__accept (sockfd=sockfd@entry=15) at ../deps/uv/src/unix/core.c:394
#3  0x00000000006da73c in uv__emfile_trick (accept_fd=15, loop=<optimized out>) at ../deps/uv/src/unix/stream.c:447
#4  uv__server_io (loop=0xe4dc20 <default_loop_struct>, w=0x196d340, events=<optimized out>) at ../deps/uv/src/unix/stream.c:521
#5  0x00000000006dedfd in uv__io_poll (loop=loop@entry=0xe4dc20 <default_loop_struct>, timeout=0) at ../deps/uv/src/unix/linux-core.c:211
#6  0x00000000006d2dc8 in uv_run (loop=0xe4dc20 <default_loop_struct>, mode=<optimized out>) at ../deps/uv/src/unix/core.c:312
#7  0x0000000000595f10 in node::Start(int, char**) ()
#8  0x00007f2ae2a8976d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000058bba5 in _start ()


It is continuously stuck between uv__server_io and uv__emfile_trick, opening and closing sockets. New connections are closed immediately, as described in the comments at the top of uv__emfile_trick, but it just won't get out of this loop, and the process as a whole is unresponsive. I am running with the cluster module, and accept backlog is set to 256.

I previously suspected this to be an issue with our own code, but it doesnt look like that any longer. errno from uv__accept is EMFILE

    fd = uv__accept(uv__stream_fd(stream));

This triggers the call to uv__emfile_trick, inside which the loops keeps closing the connections until it gets an errno of EDEADLK.

Any suggestions on how to proceed - I can go back to 0.8.x or increase the max open file limit for the time being to make this go away, but would really like to understand the problem first.

Regards
Qasim


Bert Belder

unread,
May 28, 2013, 8:07:26 AM5/28/13
to nod...@googlegroups.com
On Tuesday, May 28, 2013 8:17:09 AM UTC+2, Qasim Zaidi wrote:
Ever since we switched to 0.10.x in production, we are seeing instances where the server gets into a loop and CPU usage shoots to 100%. Here is what I have been able to understand so far about this situation.

- After some time, we run out of descriptors, i.e. the nofile limit is reached. This triggers the 100% CPU loop, and node becomes unresponsive.

What node version is this? This was supposedly fixed in libuv-v0.10.6. To benefit from the fix you need to use node-v0.10.7 or later.

Something (maybe traffic spike) triggers a socket leak in the process, and we start with getting a lot of descriptors in close_wait state.  This is very similar to what is reported at https://github.com/einaros/ws/issues/180 (Connections stay in CLOSE_WAIT with Node 0.10.x). 

Do the CLOSE_WAIT connections show up before or after node hits 100% CPU usage? A socket in CLOSE_WAIT state means that the connection was completely wound down, but the file descriptor hasn't been closed. If a small fraction of sockets is in CLOSE_WAIT state there's nothing to worry about, node just hasn't gotten around to closing the file descriptor yet. It is also "expected" when the 100%-cpu issue kicks in since node basically hangs and won't be closing any FDs at all.

- Bert

Qasim Zaidi

unread,
May 29, 2013, 1:21:56 PM5/29/13
to nod...@googlegroups.com
The node version was 0.10.3. Since then, I have now setup 3 instances , one with node 0.10.8, another original 0.10.3 and the third one is 0.8.17, to see if there is a recurrence.

Incidentally, I am clearly noticing that the instance with 0.8.17 has lower CPU usage than the other too. All 3 are behind load balancers, so looks like 0.10.x or the modules we  use have some performance regression.

Regards
Qasim

On Tuesday, May 28, 2013 8:17:09 AM UTC+2, Qasim Zaidi wrote:
Ever since we switched to 0.10.x in production, we are seeing instances where the server gets into a loop and CPU usage shoots to 100%. Here is what I have been able to understand so far about this situation.

- After some time, we run out of descriptors, i.e. the nofile limit is reached. This triggers the 100% CPU loop, and node becomes unresponsive.
What node version is this? This was supposedly fixed in libuv-v0.10.6. To benefit from the fix you need to use node-v0.10.7 or later.
Something (maybe traffic spike) triggers a socket leak in the process, and we start with getting a lot of descriptors in close_wait state.  This is very similar to what is reported at https://github.com/einaros/ws/issues/180 (Connections stay in CLOSE_WAIT with Node 0.10.x). 

Do the CLOSE_WAIT connections show up before or after node hits 100% CPU usage? A socket in CLOSE_WAIT state means that the connection was completely wound down, but the file descriptor hasn't been closed. If a small fraction of sockets is in CLOSE_WAIT state there's nothing to worry about, node just hasn't gotten around to closing the file descriptor yet. It is also "expected" when the 100%-cpu issue kicks in since node basically hangs and won't be closing any FDs at all.

- Bert

Qasim Zaidi

unread,
May 29, 2013, 1:25:43 PM5/29/13
to nod...@googlegroups.com
Also - on the closed_wait - the number of closed wait sockets is same as the nofile limit when this happens. 

The process keeps on closing the socket after accepting them. That is what the comment says, and that's exactly the behaviour I have observed in gdb. I don't know what comes first, the close_wait increase or 100% CPU usage yet.

Vitaly Puzrin

unread,
Jun 4, 2013, 7:47:43 PM6/4/13
to nod...@googlegroups.com
Thanks for info. I had this problem in production, and had to roll back to 0.8.

Symptoms were very similar - process works some time, and then starts eating 100% of CPU. The only difference was in node-log2js module, they started to use streams for writing files. Process has 50К files limit

Will check, if 10.7+ helps.

вторник, 28 мая 2013 г., 16:07:26 UTC+4 пользователь Bert Belder написал:

Vitaly Puzrin

unread,
Jun 5, 2013, 1:54:18 AM6/5/13
to nod...@googlegroups.com
Tryed 0.10.10 on fontello.com. Symptoms are the same:

- works several minutes
- starts use CPU 50-100% and leak memory
- when reach 1.5gb RES - crashes

lsof -a -p <pid> doesn't show anything special (the same descriptors count)

After roll back to 0.8, everything works without problems.

Modules are the same. The only difference, i know, is that module `log2js-node` uses readable-stream for 0.8, and native streams for 0.10

Does anyone have idea, what to do? I'm fine with 0.8, but it will reash end of life sometime...

среда, 5 июня 2013 г., 3:47:43 UTC+4 пользователь Vitaly Puzrin написал:

Isaac Schlueter

unread,
Jun 5, 2013, 11:43:38 AM6/5/13
to nodejs
This sounds exactly like https://github.com/joyent/node/issues/5504.
We're very close to a fix, and it'd be good to go weigh in there so
that we can track it more easily than on a mailing list. Short
answer: it's an odd kernel behavior where it tells us EOF rather than
the actual error code, and Node handles it improperly.

But before you head over there, just to make sure that it's the same
issue, please run this file, and then paste the results into the issue
comments:

https://raw.github.com/isaacs/node/GH-5504/test/simple/test-net-GH-5504.js

You can either drop that in `test/simple/` in the node source code, or
just comment out the `var common = require('../common')`.

Also, what is the output of `uname -a` on the affected systems?

Thanks!
> --
> --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to nod...@googlegroups.com
> To unsubscribe from this group, send email to
> nodejs+un...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "nodejs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nodejs+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Vitaly Puzrin

unread,
Jun 5, 2013, 8:28:45 PM6/5/13
to nod...@googlegroups.com, i...@izs.me
Thanks for info. Added comment to ticket.

You are right, i'm on 2.6.32 kernel. If you fix will not help, i'll upgrade my host system
(Proxmox PVE 2.3 -> PVE 3.0) and check if anything changes.

среда, 5 июня 2013 г., 19:43:38 UTC+4 пользователь Isaac Schlueter написал:
Reply all
Reply to author
Forward
0 new messages