599 Timeouts with AsyncHTTPClient

Nikolas Stevenson-Molnar

unread,

Jun 16, 2014, 4:33:44 PM6/16/14

to python-...@googlegroups.com

Hi everyone,

I've been getting intermittent 599 errors using AsyncHTTPClient. I've tried a number of solutions and have not yet been able to resolve the issue, so I thought I'd see if anyone here could offer me some clues.

The problem has been hard to get a handle on because it's inconsistent and so far only presents in our production environment (I've not been able to reproduce the problem on my local machine nor on our staging server). I have a web application which uses Django and Celery. I start a Celery task in response to some user action and in that task make a number of requests using AsyncHTTPClient. The exact number of requests varies, but typically no more than a few dozen. I also have a callback which checks for potential additional requests which need to be made.

Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:

1) Raised the file limit on the server (thinking perhaps the client had to wait on available sockets). No effect.

2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.

3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.

4) Added logging on the target server with timing information to confirm that the request is not legitimately timing out (all requests seem to be returning within a reasonable amount of time).

5) Switched to using the cURL-backed client. This has been the most promising so far, it's cut down on the rate of incidents, but they're still happening.

Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).

Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.

I'm running Tornado v3.1.

Thanks for your help!

_Nik

Ben Darnell

unread,

Jun 17, 2014, 11:55:00 PM6/17/14

to Tornado Mailing List

On Mon, Jun 16, 2014 at 4:33 PM, Nikolas Stevenson-Molnar <nik.m...@consbio.org> wrote:

Hi everyone,

I've been getting intermittent 599 errors using AsyncHTTPClient. I've tried a number of solutions and have not yet been able to resolve the issue, so I thought I'd see if anyone here could offer me some clues.

The problem has been hard to get a handle on because it's inconsistent and so far only presents in our production environment (I've not been able to reproduce the problem on my local machine nor on our staging server). I have a web application which uses Django and Celery. I start a Celery task in response to some user action and in that task make a number of requests using AsyncHTTPClient. The exact number of requests varies, but typically no more than a few dozen. I also have a callback which checks for potential additional requests which need to be made.

Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:

Do all the others return their 599s at the same time or are they spread out? After this has happened, do things return to normal or do you continue to see a lot of 599s? When you say the request is blocking for 15-20m, is the server still responding to other requests as long as they don't need to make outbound http calls?

1) Raised the file limit on the server (thinking perhaps the client had to wait on available sockets). No effect.

FYI hitting the file descriptor limit is usually pretty noisy in the logs - it starts spamming "too many open file descriptors"

2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.

You can use IOLoop.set_blocking_log_threshold to see if anything in the whole app is blocking.

3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.

DNS is a likely culprit in this scenario, since DNS resolution is blocking by default, so it's good to rule it out.

4) Added logging on the target server with timing information to confirm that the request is not legitimately timing out (all requests seem to be returning within a reasonable amount of time).

5) Switched to using the cURL-backed client. This has been the most promising so far, it's cut down on the rate of incidents, but they're still happening.

This is very instructive, since the two implementations share little code. Which version of libcurl are you using? The 7.2x versions had quite a few bugs that could cause requests to go missing like this for a long time, although I haven't personally run into any since around 7.30.

Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).

Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.

If something's blocking the IOLoop, set_blocking_log_threshold is the best way to track it down. If an HTTP request is getting lost without actually blocking, that's much trickier.

With curl_httpclient, you can log response.time_info on one of these long timed-out requests to see a little detail about where it spent its time. You might also want to log len(client._curls), len(client._free_list), and len(client._requests) from time to time to confirm that free_list is becoming empty and _requests is filling up when this problem is occurring. You can loop over client._curls and print curl.info['request'] and curl.info['curl_start_time'] to see what it's currently working on. You can also enable debug logging, which will produce a ton of output that may or may not be useful.

Increasing the max_clients option on AsyncHTTPClient.configure will probably help mitigate the problem. There's little downside to increasing it from its default of 10, although it would be good to figure out why it's getting blocked in the first place.

-Ben

I'm running Tornado v3.1.

Thanks for your help!
_Nik

--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikolas Stevenson-Molnar

unread,

Jun 18, 2014, 11:39:51 AM6/18/14

to python-...@googlegroups.com

Hi Ben,

Thanks for your help! Responses inline below:

On 6/17/2014 8:54 PM, Ben Darnell wrote:

Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:

Do all the others return their 599s at the same time or are they spread out? After this has happened, do things return to normal or do you continue to see a lot of 599s? When you say the request is blocking for 15-20m, is the server still responding to other requests as long as they don't need to make outbound http calls?

Yes, all the requests return 599 at the same time. I'm not using Tornado to serve requests, it's running in a Celery process. However, I have noticed that the periodic callback is not called while the request is blocking. Afterwards, things seem to work again until the next problem. However, this is after the ioloop is restarted (the general workflow is: user makes a request, the Django app starts a Celery process which starts a Tornado ioloop to make some requests and then stops until the next request comes in).

2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.

You can use IOLoop.set_blocking_log_threshold to see if anything in the whole app is blocking.

Ok, I'll look into that. I haven't been able to get logging in Tornado working yet (I think it's conflicting with Django's logging configuration) but I'll dig into it some more and see what I can figure out.

3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.

DNS is a likely culprit in this scenario, since DNS resolution is blocking by default, so it's good to rule it out.

Yes, I saw a lot of mention of that. Using IP address should mean that part is skipped entirely though, correct?

4) Added logging on the target server with timing information to confirm that the request is not legitimately timing out (all requests seem to be returning within a reasonable amount of time).

5) Switched to using the cURL-backed client. This has been the most promising so far, it's cut down on the rate of incidents, but they're still happening.

This is very instructive, since the two implementations share little code. Which version of libcurl are you using? The 7.2x versions had quite a few bugs that could cause requests to go missing like this for a long time, although I haven't personally run into any since around 7.30.

I'm running libcurl 7.36.0 (the latest available via yum on the Amazon instance this is running on). I have read that the pre-built versions of libcurl use synchronous DNS resolution, but again I would think that using IP address should get around that potential problem.

Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).

Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.

If something's blocking the IOLoop, set_blocking_log_threshold is the best way to track it down. If an HTTP request is getting lost without actually blocking, that's much trickier.

The fact that all outstanding requests return 599 at the same time and that the period callback stops in the interim would imply the former (blocking), correct?

With curl_httpclient, you can log response.time_info on one of these long timed-out requests to see a little detail about where it spent its time. You might also want to log len(client._curls), len(client._free_list), and len(client._requests) from time to time to confirm that free_list is becoming empty and _requests is filling up when this problem is occurring. You can loop over client._curls and print curl.info['request'] and curl.info['curl_start_time'] to see what it's currently working on. You can also enable debug logging, which will produce a ton of output that may or may not be useful.

Ok, I will try to gather some of this information. Thanks for pointing me to these; response.time_info seems like it could be particularly helpful. I'll let you know what I can find out.

Increasing the max_clients option on AsyncHTTPClient.configure will probably help mitigate the problem. There's little downside to increasing it from its default of 10, although it would be good to figure out why it's getting blocked in the first place.

This wouldn't help if the problem is blocking though, right? Because if one requests blocks, all the others have to wait for it to finish.

_Nik

Ben Darnell

unread,

Jun 18, 2014, 11:46:13 PM6/18/14

to Tornado Mailing List

On Wed, Jun 18, 2014 at 11:39 AM, Nikolas Stevenson-Molnar <nik.m...@consbio.org> wrote:

Hi Ben,

Thanks for your help! Responses inline below:

On 6/17/2014 8:54 PM, Ben Darnell wrote:

Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:

Do all the others return their 599s at the same time or are they spread out? After this has happened, do things return to normal or do you continue to see a lot of 599s? When you say the request is blocking for 15-20m, is the server still responding to other requests as long as they don't need to make outbound http calls?

Yes, all the requests return 599 at the same time. I'm not using Tornado to serve requests, it's running in a Celery process. However, I have noticed that the periodic callback is not called while the request is blocking. Afterwards, things seem to work again until the next problem. However, this is after the ioloop is restarted (the general workflow is: user makes a request, the Django app starts a Celery process which starts a Tornado ioloop to make some requests and then stops until the next request comes in).

You're starting and stopping an IOLoop? That's unusual and can be tricky to get right; when I integrate tornado with django I dedicate a thread to running the IOLoop continuously.

2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.

You can use IOLoop.set_blocking_log_threshold to see if anything in the whole app is blocking.

Ok, I'll look into that. I haven't been able to get logging in Tornado working yet (I think it's conflicting with Django's logging configuration) but I'll dig into it some more and see what I can figure out.

Tornado logs to the standard python logging module, so it should just work. One difference is that tornado sets the log level to INFO by default but the python basicConfig default is WARNING. That shouldn't make any difference here though since blocking_log_threshold uses WARNING.

3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.

DNS is a likely culprit in this scenario, since DNS resolution is blocking by default, so it's good to rule it out.

Yes, I saw a lot of mention of that. Using IP address should mean that part is skipped entirely though, correct?

In theory. It still calls getaddrinfo, which is supposed to recognize that it's looking at an IP address and return it right away. However, getaddrinfo is a complex black box so it's possible it's doing something weird.

Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).

Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.

If something's blocking the IOLoop, set_blocking_log_threshold is the best way to track it down. If an HTTP request is getting lost without actually blocking, that's much trickier.

The fact that all outstanding requests return 599 at the same time and that the period callback stops in the interim would imply the former (blocking), correct?

Yes.

Increasing the max_clients option on AsyncHTTPClient.configure will probably help mitigate the problem. There's little downside to increasing it from its default of 10, although it would be good to figure out why it's getting blocked in the first place.

This wouldn't help if the problem is blocking though, right? Because if one requests blocks, all the others have to wait for it to finish.

Correct.

-Ben

Nikolas Stevenson-Molnar

unread,

Jun 19, 2014, 12:12:36 PM6/19/14

to python-...@googlegroups.com

On 6/18/2014 8:45 PM, Ben Darnell wrote:
> Yes, all the requests return 599 at the same time. I'm not using
> Tornado to serve requests, it's running in a Celery process. However,
> I have noticed that the periodic callback is not called while the
> request is blocking. Afterwards, things seem to work again until the
> next problem. However, this is after the ioloop is restarted (the
> general workflow is: user makes a request, the Django app starts a
> Celery process which starts a Tornado ioloop to make some requests and
> then stops until the next request comes in).
>
> You're starting and stopping an IOLoop? That's unusual and can be
> tricky to get right; when I integrate tornado with django I dedicate a
> thread to running the IOLoop continuously.

Ok. That could explain why it seems to take a bit for the problems to
start appearing after restarting the Celery workers. There are a couple
of reasons I set it up this way. One is that I'm using one worker per
target server and using the max_clients setting to avoid putting too
much load on that server. The other was just to avoid having to manage a
separate piece of architecture, though I could certainly change it if
its causing problems.

Would it be better if I created new IOLoop instances rather than reusing
IOLoop.instance()?

I'm now logging out response.time_info, but it seems to be always empty
after getting a 599 (it does have information after other errors, like 404).

_Nik

Ben Darnell

unread,

Jun 19, 2014, 10:56:40 PM6/19/14

to Tornado Mailing List

On Thu, Jun 19, 2014 at 12:12 PM, Nikolas Stevenson-Molnar <nik.m...@consbio.org> wrote:

On 6/18/2014 8:45 PM, Ben Darnell wrote:
> Yes, all the requests return 599 at the same time. I'm not using
> Tornado to serve requests, it's running in a Celery process. However,
> I have noticed that the periodic callback is not called while the
> request is blocking. Afterwards, things seem to work again until the
> next problem. However, this is after the ioloop is restarted (the
> general workflow is: user makes a request, the Django app starts a
> Celery process which starts a Tornado ioloop to make some requests and
> then stops until the next request comes in).
>
> You're starting and stopping an IOLoop? That's unusual and can be
> tricky to get right; when I integrate tornado with django I dedicate a
> thread to running the IOLoop continuously.

Ok. That could explain why it seems to take a bit for the problems to
start appearing after restarting the Celery workers. There are a couple
of reasons I set it up this way. One is that I'm using one worker per
target server and using the max_clients setting to avoid putting too
much load on that server. The other was just to avoid having to manage a
separate piece of architecture, though I could certainly change it if
its causing problems.

Would it be better if I created new IOLoop instances rather than reusing
IOLoop.instance()?

It's hard to say for sure without more details about your architecture (specifically as it relates to threading), but calling IOLoop.instance().start() more than once (except in situations like unit tests) is unlikely to be the best answer; it sounds like an invitation for threading-related problems. If you must start and stop IOLoops it's generally best to create a new one each time, like the synchronous HTTPClient does. However, I would generally recommend creating a dedicated IOLoop thread (and only interacting with it via add_callback) so you only start the IOLoop once.

-Ben

I'm now logging out response.time_info, but it seems to be always empty
after getting a 599 (it does have information after other errors, like 404).

_Nik

Nikolas Stevenson-Molnar

unread,

Jun 24, 2014, 12:59:39 PM6/24/14

to python-...@googlegroups.com

On 6/19/2014 7:56 PM, Ben Darnell wrote:

Ok. That could explain why it seems to take a bit for the problems to
start appearing after restarting the Celery workers. There are a couple
of reasons I set it up this way. One is that I'm using one worker per
target server and using the max_clients setting to avoid putting too
much load on that server. The other was just to avoid having to manage a
separate piece of architecture, though I could certainly change it if
its causing problems.

Would it be better if I created new IOLoop instances rather than reusing
IOLoop.instance()?

It's hard to say for sure without more details about your architecture (specifically as it relates to threading), but calling IOLoop.instance().start() more than once (except in situations like unit tests) is unlikely to be the best answer; it sounds like an invitation for threading-related problems. If you must start and stop IOLoops it's generally best to create a new one each time, like the synchronous HTTPClient does. However, I would generally recommend creating a dedicated IOLoop thread (and only interacting with it via add_callback) so you only start the IOLoop once.

-Ben

Ok, I'll try to explain the setup. There's no threading actually, just Celery workers. The workflow is something like this:

The main web application calls a Celery task asynchronously to make a number of HTTP requests.
The Celery task starts an IOLoop and begins making requests using the AsyncHTTPClient.
Once all requests have completed, the IOLoop is stopped and the task returns so that the Celery worker is free to process other tasks.

I am now creating a new IOLoop (at step #2) each time this process happens, but unfortunately the problem persists. Given that, in your experience, does this seem like something that would be solved by re-designing this as a daemon which runs a single IOLoop indefinitely?

_Nik

Ben Darnell

unread,

Jun 24, 2014, 5:25:42 PM6/24/14

to Tornado Mailing List

--

Reply all

Reply to author

Forward