Tornado server using 100% CPU with no activity, socket connection stuck in CLOSED state

Debby Mendez

unread,

May 23, 2014, 3:57:41 PM5/23/14

to python-...@googlegroups.com

Running Tornado 3.2 on Python 3.4 on Freebsd 8

Currently we have very little activity on our Tornado based servers, but a couple of times now we have seen a problem where the server is pegged at 100% CPU with no traffic. I've collected a bunch of stack traces (process will write a trace when you send a sigusr1). Many (but not all) of them look like this:

2014/05/22 14:38:45 SERIOUS ### Dumping stack traces on sigusr1 ###
2014/05/22 14:38:45 SERIOUS # Thread: MainThread(34374435264)
2014/05/22 14:38:45 SERIOUS File: "service.py", line 820, in <module>
2014/05/22 14:38:45 SERIOUS tornado.ioloop.IOLoop.instance().start()
2014/05/22 14:38:45 SERIOUS File: "/usr/local/lib/python3.4/site-packages/tornado-3.2-py3.4-freebsd-8.3-RELEASE-p14-amd64.egg/tornado/ioloop.py", line 688, in start
2014/05/22 14:38:45 SERIOUS self._handlers[fd](fd, events)
2014/05/22 14:38:45 SERIOUS File: "/usr/local/lib/python3.4/site-packages/tornado-3.2-py3.4-freebsd-8.3-RELEASE-p14-amd64.egg/tornado/stack_context.py", line 302, in wrapped
2014/05/22 14:38:45 SERIOUS ret = fn(*args, **kwargs)
2014/05/22 14:38:45 SERIOUS File: "/usr/local/lib/python3.4/site-packages/tornado-3.2-py3.4-freebsd-8.3-RELEASE-p14-amd64.egg/tornado/iostream.py", line 316, in _handle_events
2014/05/22 14:38:45 SERIOUS self._handle_read()
2014/05/22 14:38:45 SERIOUS File: "/usr/local/lib/python3.4/site-packages/tornado-3.2-py3.4-freebsd-8.3-RELEASE-p14-amd64.egg/tornado/iostream.py", line 895, in _handle_read
2014/05/22 14:38:45 SERIOUS self._do_ssl_handshake()
2014/05/22 14:38:45 SERIOUS File: "/usr/local/lib/python3.4/site-packages/tornado-3.2-py3.4-freebsd-8.3-RELEASE-p14-amd64.egg/tornado/iostream.py", line 848, in _do_ssl_handshake
2014/05/22 14:38:45 SERIOUS except socket.error as err:

procstat -f shows that the process has one socket open. netstat shows that connection is in CLOSED state. There is no evidence of that connection in our log files so I don't think it ever got very far, perhaps never past the handshake to begin with.

Could there be a way for a connection to be closed by the other side but tornado keeps the socket open and is somehow getting read events on it? I've looked but can't find any similar reports, just wondering if this has come up for anyone else.

Ben Darnell

unread,

May 24, 2014, 10:54:47 AM5/24/14

to Tornado Mailing List

Can you reproduce this or is it just something that happens from time to time? When it happens does it ever recover on its own or does it stay this way until you restart the process? I have a few theories, but I'm not sure how to test any fix.

My best guess is that openssl in non-blocking mode is not interpreting an empty read from the socket as EOF, and so it's just trying to read from the closed socket endlessly. On linux we get an IOLoop.ERROR flag from epoll, so IOStream will see it and close the connection, but on other platforms this doesn't happen. I'm not sure where exactly the connection should be detected as closed in this case.

-Ben

--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Darnell

unread,

May 24, 2014, 12:33:33 PM5/24/14

to Tornado Mailing List

What does this script print when you run it?

https://gist.github.com/bdarnell/93c569221f06e307db4e

I get the following, which indicates that openssl is correctly detecting that the socket was closed mid-handshake.

$ python3.4 ssltest.py

python3.4 ssltest.py

EOF occurred in violation of protocol (_ssl.c:598)

[Errno 54] Connection reset by peer

None

iterations: 6

.

----------------------------------------------------------------------

Ran 1 test in 0.032s

OK

Debby Mendez

unread,

May 27, 2014, 9:18:19 AM5/27/14

to python-...@googlegroups.com, b...@bendarnell.com

I get the same thing on freebsd:

[debby@BSD-Builder ~/glance]$ python3.4 ssltest.py

EOF occurred in violation of protocol (_ssl.c:598)

[Errno 54] Connection reset by peer

None

iterations: 6

.

----------------------------------------------------------------------

Ran 1 test in 0.042s

OK

Debby Mendez

unread,

May 27, 2014, 9:29:18 AM5/27/14

to python-...@googlegroups.com, b...@bendarnell.com

- Cannot reproduce it, just happens from time to time, but only on our production servers, suggesting it is related to something that is trying to connect in some unexpected way

- Never recovers. We currently have a process that has been in this state for 4 days.

Could there be any issue with a client that opens a connection and then closes it without ever completing the handshake?

Debby Mendez

unread,

May 27, 2014, 4:17:03 PM5/27/14

to python-...@googlegroups.com

Is your thinking that in SSLIOStream._do_ssl_handshake(), the call to do_handshake() is throwing SSL_ERROR_WANT_READ when it really should be throwing a socket.error with _ERRNO_CONNRESET?

I tried pointing the client side of your test script at my server and tried some variations that send garbage handshake, etc. in an attempt to reproduce it, but no success.

Maybe at the least I can add some code to guard against this, perhaps by limiting the number of times do_handshake() could be called on a stream before closing it.

Ben Darnell

unread,

May 27, 2014, 5:14:28 PM5/27/14

to Tornado Mailing List

On Tue, May 27, 2014 at 4:17 PM, Debby Mendez <de...@josesmex.com> wrote:

Is your thinking that in SSLIOStream._do_ssl_handshake(), the call to do_handshake() is throwing SSL_ERROR_WANT_READ when it really should be throwing a socket.error with _ERRNO_CONNRESET?

Yes, or SSL_ERROR_EOF.

I tried pointing the client side of your test script at my server and tried some variations that send garbage handshake, etc. in an attempt to reproduce it, but no success.

OK, it sounds like we've disproved this hypothesis.

Maybe at the least I can add some code to guard against this, perhaps by limiting the number of times do_handshake() could be called on a stream before closing it.

This sounds like a good next step. After 10 calls to do_handshake, you could start doing some additional logging, and then close it after 20. I would try logging the `events` argument to `_handle_events`, any exception raised by do_handshake, and how long do_handshake took.

-Ben

Debby Mendez

unread,

Jul 24, 2014, 9:57:09 AM7/24/14

to python-...@googlegroups.com

OK, I patched Torndado 3.2 with additional diagnostics in iostream.py as you suggested and did get a couple of incidents simultaneously on 2 of our servers. The connections are coming from something that is scanning servers/ports, not our client.

The logging I added shows that the socket is getting READ events, but socket.do_handshake() triggers socket error 57 (EBADF - bad file descriptor):

2014/07/07 13:17:35 Got events [1] for socket 12 looping excessively in ssl handshake
2014/07/07 13:17:35 ssl handshake count 12 for socket 12 from (not connected)
2014/07/07 13:17:35 socket.error: 57 on (not connected)

Because EBADF is not included in _ERRNO_CONNRESET, tornado does not give up and close the socket.

Here is a tcpdump from one of the incidents, which I think shows that the server is resetting the connection.

13:25:15.164062 IP 212.83.148.113.51730 > 64.95.76.27.443: Flags [S], seq 4127360044, win 32, options [mss 1460,nop,wscale 0,nop,nop,sackOK], length 0
13:25:15.164208 IP 64.95.76.27.443 > 212.83.148.113.51730: Flags [S.], seq 1489116201, ack 4127360045, win 65535, options [mss 1460,nop,wscale 3,sackOK,eol], length 0
13:25:15.270470 IP 212.83.148.113.51730 > 64.95.76.27.443: Flags [.], ack 1, win 32, length 0
13:25:15.270475 IP 212.83.148.113.51730 > 64.95.76.27.443: Flags [R.], seq 1, ack 1, win 0, length 0
13:25:15.273408 IP 212.83.148.113.61581 > 64.95.76.27.8443: Flags [S], seq 288987127, win 32, options [mss 1460,nop,wscale 0,nop,nop,sackOK], length 0
13:25:15.273414 IP 64.95.76.27.8443 > 212.83.148.113.61581: Flags [S.], seq 4057220914, ack 288987128, win 65535, options [mss 1460,nop,wscale 3,sackOK,eol], length 0
13:25:15.379878 IP 212.83.148.113.61581 > 64.95.76.27.8443: Flags [.], ack 1, win 32, length 0
13:25:15.380314 IP 212.83.148.113.61581 > 64.95.76.27.8443: Flags [R.], seq 1, ack 1, win 0, length 0

Could Tornado have closed the socket but kept a bad file descriptor?

Ben Darnell

unread,

Jul 24, 2014, 11:44:01 PM7/24/14

to Tornado Mailing List

Aha, I see the problem now. The except block at https://github.com/tornadoweb/tornado/blob/v4.0.0/tornado/iostream.py#L1186 needs an "else: raise" at the end just like the preceding one does; currently errors other than CONNRESET are simply caught and silently discarded. The socket is still partially open (or else you'd be getting EBADF errors from epoll), but the other side has closed and that is causing some kind of problem. The packet sequence you're seeing is characteristic of nmap in "-sT" mode, although when I run it against my own machine I can't reproduce the 100%-CPU spin (I tried with both the 3.2 and 4.0 branches).

The fix: https://github.com/tornadoweb/tornado/commit/7b62519b0cc4cbcf64eea6cf82d218039c8d9804

Thanks,

-Ben

--

Debby Mendez

unread,

Jul 25, 2014, 10:35:32 AM7/25/14

to python-...@googlegroups.com, b...@bendarnell.com

Thanks so much!

Debby

Reply all

Reply to author

Forward