Keepalive and timeout problems with Lua, Redis and Ledge

908 views
Skip to first unread message

Kim Brandt

unread,
Dec 7, 2015, 1:29:28 PM12/7/15
to openresty-en
Hello, been using OpenResty together with Ledge (https://github.com/pintsized/ledge) for a while, and been struggling with timeouts and keepalive error/warn messages in my Nginx error.log.

For example:
[error] 173#0: *48089 [lua] ledge.lua:2340: fetch_from_origin(): timeout, client: x.x.x.x, server: _, request: "GET /request HTTP/1.1", host: "www.example.com"
[warn] 173#0: *48089 [lua] ledge.lua:444: couldn't set keepalive, socket busy reading, client: x.x.x.x, server: _, request: "GET /request HTTP/1.1", host: "www.example.com"

I would be grateful if there's someone that could point me in a direction where to begin looking.

Thanks.

Yichun Zhang (agentzh)

unread,
Dec 7, 2015, 1:35:16 PM12/7/15
to openresty-en
Hello!

On Mon, Dec 7, 2015 at 10:29 AM, Kim Brandt wrote:
> For example:
> [error] 173#0: *48089 [lua] ledge.lua:2340: fetch_from_origin(): timeout,
> client: x.x.x.x, server: _, request: "GET /request HTTP/1.1", host:
> "www.example.com"

Debugging timeouts requires advanced (dynamic tracing) tools to be sure.

Common causes are too short timeout threshold settings on the client
or slow backend servers that are too busy (intermittently). Sometimes,
if you are using some badly blocking operations (like slow disk
operations, too intensive CPU computations, and even bad Lua libraries
doing blocking I/O) in the nginx server) can also lead to timeout
errors, in which case, you can use the off-CPU time flame graph tool
to diagnose.

> [warn] 173#0: *48089 [lua] ledge.lua:444: couldn't set keepalive, socket
> busy reading, client: x.x.x.x, server: _, request: "GET /request HTTP/1.1",
> host: "www.example.com"
>

This looks like a user programming mistake on the Lua land. Basically
one should never call setkeepalive() on a cosocket in a "light thread"
while another "light thread" is busy reading from the same cosocket
(for obvious reasons). You should fix that in your Lua code (or some
other's Lua code you are using).

Best regards,
-agentzh

Kim Brandt

unread,
Dec 8, 2015, 8:49:51 AM12/8/15
to openresty-en
Thanks for the quick reply agentzh.

I think I have narrowed it down to be a proxy issue.
We're using Amazon Beanstalk with Docker, and on the host instance, there is standard Nginx proxying all incoming requests on port 80 into the Docker container.
When I've debugged, using only the Docker container with no proxy, there seems to be no problem with keepalive warnings.
Would you know if there's some parameters or headers that we should pass on to the Docker containers OpenResty?
The keepalive setting for upstream is set to 256, and increasing that value doesn't seem to have any affect.

James Hurst

unread,
Dec 8, 2015, 11:56:33 AM12/8/15
to openre...@googlegroups.com
Hi Kim,

In case it's not obvious, those two messages are for different reasons. The first as you've seen is an issue with the backend timing out. You should have seen Ledge returning 524 to the client in this case. It's a normal error if the backend is too slow - you could increase the timeout, or try to figure out why it's so slow behind that proxy. Buffering perhaps?

The second is a warning after a failure to place the Redis connection in the keepalive pool, because there's still data to be read. Since under normal circumstances this only occurs when everything is finished, my guess is that this is after a *client* timeout, so Ledge abandons the work it has been doing and tries to clean up as best it can, but can't do anything with this particular Redis connection. You should see the client timeout in your logs too which would confirm this? We do try to restore the connection to a "NORMAL" state with the DISCARD command, in case it's mid-transaction, but perhaps that's not always enough.

I'm not sure how else to handle that situation (if my guess is right). Essentially we're abandoning the current request because the client has aborted, I guess the reusability of the Redis connection might be down to luck. Though to be honest, I'm not sure how this would actually happen with data still to be read from the socket. It is a bit odd. Do you have collapsed forwarding enabled at all?

Let me know if you find any more details and I'll take a closer look.

Cheers,

James.


--
You received this message because you are subscribed to the Google Groups "openresty-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openresty-en...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Kim Brandt

unread,
Dec 9, 2015, 10:21:37 AM12/9/15
to openresty-en
Hello James,

Yes I know the differences in the two log lines.
The main, and biggest reason why asking for help here, was the 524 errors that occasionally made our environment very unstable.

I have tested throughout the day with different approaches, and come up with some changes that does work better both in development, and Amazon testing environment.
  1. Increased the worker_rlimit_nofile.
  2. Enabled "enable_collapsed_forwarding" in Ledge.
  3. Only run our authentication Lua script where it's needed. (i.e. disable it on static files.)
With those changes, I can't trigger the timeout errors I've had before.
Though in Amazon environment the keepalive warnings still come, and the only differences between development and Amazon environment, is the Nginx proxy that Amazon provides.
Next steps would be to adjust those settings.
Reply all
Reply to author
Forward
0 new messages