Hi everyone,I've been getting intermittent 599 errors using AsyncHTTPClient. I've tried a number of solutions and have not yet been able to resolve the issue, so I thought I'd see if anyone here could offer me some clues.The problem has been hard to get a handle on because it's inconsistent and so far only presents in our production environment (I've not been able to reproduce the problem on my local machine nor on our staging server). I have a web application which uses Django and Celery. I start a Celery task in response to some user action and in that task make a number of requests using AsyncHTTPClient. The exact number of requests varies, but typically no more than a few dozen. I also have a callback which checks for potential additional requests which need to be made.Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:
1) Raised the file limit on the server (thinking perhaps the client had to wait on available sockets). No effect.
2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.
3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.
4) Added logging on the target server with timing information to confirm that the request is not legitimately timing out (all requests seem to be returning within a reasonable amount of time).5) Switched to using the cURL-backed client. This has been the most promising so far, it's cut down on the rate of incidents, but they're still happening.
Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.
I'm running Tornado v3.1.
Thanks for your help!_Nik
--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:
Do all the others return their 599s at the same time or are they spread out? After this has happened, do things return to normal or do you continue to see a lot of 599s? When you say the request is blocking for 15-20m, is the server still responding to other requests as long as they don't need to make outbound http calls?
2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.
You can use IOLoop.set_blocking_log_threshold to see if anything in the whole app is blocking.
3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.
DNS is a likely culprit in this scenario, since DNS resolution is blocking by default, so it's good to rule it out.
4) Added logging on the target server with timing information to confirm that the request is not legitimately timing out (all requests seem to be returning within a reasonable amount of time).5) Switched to using the cURL-backed client. This has been the most promising so far, it's cut down on the rate of incidents, but they're still happening.
This is very instructive, since the two implementations share little code. Which version of libcurl are you using? The 7.2x versions had quite a few bugs that could cause requests to go missing like this for a long time, although I haven't personally run into any since around 7.30.
Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).
Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.
If something's blocking the IOLoop, set_blocking_log_threshold is the best way to track it down. If an HTTP request is getting lost without actually blocking, that's much trickier.
With curl_httpclient, you can log response.time_info on one of these long timed-out requests to see a little detail about where it spent its time. You might also want to log len(client._curls), len(client._free_list), and len(client._requests) from time to time to confirm that free_list is becoming empty and _requests is filling up when this problem is occurring. You can loop over client._curls and print curl.info['request'] and curl.info['curl_start_time'] to see what it's currently working on. You can also enable debug logging, which will produce a ton of output that may or may not be useful.
Increasing the max_clients option on AsyncHTTPClient.configure will probably help mitigate the problem. There's little downside to increasing it from its default of 10, although it would be good to figure out why it's getting blocked in the first place.
Hi Ben,
Thanks for your help! Responses inline below:Yes, all the requests return 599 at the same time. I'm not using Tornado to serve requests, it's running in a Celery process. However, I have noticed that the periodic callback is not called while the request is blocking. Afterwards, things seem to work again until the next problem. However, this is after the ioloop is restarted (the general workflow is: user makes a request, the Django app starts a Celery process which starts a Tornado ioloop to make some requests and then stops until the next request comes in).
On 6/17/2014 8:54 PM, Ben Darnell wrote:
Sometimes this all works fine; everything behaves exactly as expected. But other times, it seems like one of the HTTP requests will block for a long time (15, 20 minutes) and eventually it and all other requests in that batch will timeout and return 599 statuses. Here are the things I've looked and and fixes I've tried:
Do all the others return their 599s at the same time or are they spread out? After this has happened, do things return to normal or do you continue to see a lot of 599s? When you say the request is blocking for 15-20m, is the server still responding to other requests as long as they don't need to make outbound http calls?
Ok, I'll look into that. I haven't been able to get logging in Tornado working yet (I think it's conflicting with Django's logging configuration) but I'll dig into it some more and see what I can figure out.
2) Ensured my periodic callback was not blocking. I added logging about its execution time and confirmed that in the worst case it would take a little more than a second to execute and in most cases finished in under a tenth second.
You can use IOLoop.set_blocking_log_threshold to see if anything in the whole app is blocking.
Yes, I saw a lot of mention of that. Using IP address should mean that part is skipped entirely though, correct?
3) Used IP addresses for the requests, thinking maybe the resolution was timing out. No effect.
DNS is a likely culprit in this scenario, since DNS resolution is blocking by default, so it's good to rule it out.
The fact that all outstanding requests return 599 at the same time and that the period callback stops in the interim would imply the former (blocking), correct?Based on my observations so far, my best guess is that on occasion, one of the requests is blocking indefinitely. I say this because all requests in the same process time out at the same time. I've searched for 599 errors and found several mentions on this list and on Stack Overflow, but so far none of those solutions have panned out (mostly the solution was to upgrade Tornado, but that was a while ago and I'm on a version past the ones which fixed other 599 problems).
Can anyone give me some advice on how to debug this problem further? It would be hugely helpful to know what exactly is blocking, but I'm not sure the best way to go about getting this information.
If something's blocking the IOLoop, set_blocking_log_threshold is the best way to track it down. If an HTTP request is getting lost without actually blocking, that's much trickier.
This wouldn't help if the problem is blocking though, right? Because if one requests blocks, all the others have to wait for it to finish.Increasing the max_clients option on AsyncHTTPClient.configure will probably help mitigate the problem. There's little downside to increasing it from its default of 10, although it would be good to figure out why it's getting blocked in the first place.
On 6/18/2014 8:45 PM, Ben Darnell wrote:Ok. That could explain why it seems to take a bit for the problems to
> Yes, all the requests return 599 at the same time. I'm not using
> Tornado to serve requests, it's running in a Celery process. However,
> I have noticed that the periodic callback is not called while the
> request is blocking. Afterwards, things seem to work again until the
> next problem. However, this is after the ioloop is restarted (the
> general workflow is: user makes a request, the Django app starts a
> Celery process which starts a Tornado ioloop to make some requests and
> then stops until the next request comes in).
>
> You're starting and stopping an IOLoop? That's unusual and can be
> tricky to get right; when I integrate tornado with django I dedicate a
> thread to running the IOLoop continuously.
start appearing after restarting the Celery workers. There are a couple
of reasons I set it up this way. One is that I'm using one worker per
target server and using the max_clients setting to avoid putting too
much load on that server. The other was just to avoid having to manage a
separate piece of architecture, though I could certainly change it if
its causing problems.
Would it be better if I created new IOLoop instances rather than reusing
IOLoop.instance()?
I'm now logging out response.time_info, but it seems to be always empty
after getting a 599 (it does have information after other errors, like 404).
_Nik
Ok. That could explain why it seems to take a bit for the problems to
start appearing after restarting the Celery workers. There are a couple
of reasons I set it up this way. One is that I'm using one worker per
target server and using the max_clients setting to avoid putting too
much load on that server. The other was just to avoid having to manage a
separate piece of architecture, though I could certainly change it if
its causing problems.
Would it be better if I created new IOLoop instances rather than reusing
IOLoop.instance()?
It's hard to say for sure without more details about your architecture (specifically as it relates to threading), but calling IOLoop.instance().start() more than once (except in situations like unit tests) is unlikely to be the best answer; it sounds like an invitation for threading-related problems. If you must start and stop IOLoops it's generally best to create a new one each time, like the synchronous HTTPClient does. However, I would generally recommend creating a dedicated IOLoop thread (and only interacting with it via add_callback) so you only start the IOLoop once.
-Ben
I am now creating a new IOLoop (at step #2) each time this
process happens, but unfortunately the problem persists. Given
that, in your experience, does this seem like something that would
be solved by re-designing this as a daemon which runs a single
IOLoop indefinitely?
_Nik
--