lua tcp socket read timed out

2,349 views
Skip to first unread message

Ron Schwartz

unread,
Dec 10, 2013, 4:23:16 AM12/10/13
to openre...@googlegroups.com
Hello,
We are running OpenResty with redis client for quite a long in production. In the last days we started getting "lua tcp socket read timed out" errors.

This is our code:

function logErrorAndExit(err)
   ngx.log(ngx.ERR, err)
    ngx.exec('/_.gif')
end

 local red = redis:new()
 red:set_timeout(3000) -- 3 sec
 local ok, err = red:connect("127.0.0.1", 6379)
 if not ok then logErrorAndExit("Error connecting to redis: ".. err) end
 
ok, err = red:evalsha(<some_hash>, 1, "args", args_json)
if not ok then logErrorAndExit("Error evaluating redis script: ".. err) end

ok, err = red:set_keepalive(10000, 100)
if not ok then logErrorAndExit("Error setting redis keep alive ".. err) end

________________________________________________________

From inspecting the error log, it doesn't seem it happens on the call red:connect(), red:evalsha or red:set_keep_alive. (Because i dont see the custom error messages as described above)

Any idea what can cause it?

Thanks,
Ron



Yichun Zhang (agentzh)

unread,
Dec 10, 2013, 10:40:46 AM12/10/13
to openresty-en
Hello!

On Tue, Dec 10, 2013 at 1:23 AM, Ron Schwartz wrote:
> We are running OpenResty with redis client for quite a long in production.
> In the last days we started getting "lua tcp socket read timed out" errors.
>

If only the Lua code snippet you provided uses ngx_lua's cosocket API,
then only evalsha() can yield this "read timed out" error. Neither
connect() nor set_keepalive() would.

> From inspecting the error log, it doesn't seem it happens on the call
> red:connect(), red:evalsha or red:set_keep_alive. (Because i dont see the
> custom error messages as described above)
>

This is weird. Are you sure you don't have any other Lua code doing
cosockets? Also you can emulate redis read timeout in your redis Lua
script and test if your Lua-land error handling code can indeed work.

One way to debug this further is to write a systemtap or dtrace based
tool to trace this "read timed out" error in production and dump out
the Lua-land backtraces at those positions on the fly.

One probing point candidate is the C function
ngx_http_lua_socket_tcp_receive_retval_handler in nginx. Check out the
stap++ project for some similar examples:

https://github.com/agentzh/stapxx

Regards,
-agentzh

Ron Schwartz

unread,
Dec 19, 2013, 10:18:43 AM12/19/13
to openre...@googlegroups.com
Thanks for your answer! 
I've have some more information on this problem.
I'm using apache banch (ab) for load testing in our testing environment and every time i call the openresty server with more than 100 concurrent connections (ab -n 2000 -c 100 http://our_server) i get "lua tcp socket read timed out" on  red:eval_sha(....) command.

I've checked the redis slowlog and it seems the redis opearation takes around 0.01 seconds.
Before i connect to the redis i set the timeout to 3 seconds (red:set_timeout(3000))

I also use keep_alive at the end of the redis operation (ok, err = red:set_keepalive(10000, 100)) 

After raising the connection timeout to 10 seconds, (red:set_timeout(10000)), it appears that there are no "lua tcp socket read time out" errors with 100 concurrent connections, but when i try with 200 concurrent connections, they are back.


*** Any idea what can i do to increase the number of concurrent connections to our system?
*** If i use the unix domain socket to connect to the redis, i will be able to support more concurrent connections?

Thanks,
Ron

 ab -n 2000 -c 100 'http://54.214.48.186/reads

Yichun Zhang (agentzh)

unread,
Dec 19, 2013, 2:32:48 PM12/19/13
to openresty-en
Hello!

On Thu, Dec 19, 2013 at 7:18 AM, Ron Schwartz wrote:
> I've have some more information on this problem.
> I'm using apache banch (ab) for load testing in our testing environment and
> every time i call the openresty server with more than 100 concurrent
> connections (ab -n 2000 -c 100 http://our_server) i get "lua tcp socket read
> timed out" on red:eval_sha(....) command.
>

The problem with tools like ab is that they always try to send
requests on all the concurrent connections as fast as possible,
driving the server to its throughput limit (or request rate limit).

When you're at the throughput limit, increasing the concurrency level
will surely sacrifice the request latency. This is a natural
consequence otherwise you'll just break the basic maths.

Try using those tools that can generate a constant request rate that
is not near the throughput limit for experimenting various different
concurrency levels.

> I've checked the redis slowlog and it seems the redis opearation takes
> around 0.01 seconds.

10ms is already a very long latency for redis. Consider that every
redis server is single-threaded and all the requests has to queue up
within the redis server for processing. So the queuing time for the
requests would increase dramatically when the request rate is high.
Try optimizing your redis side Lua code and reduce this processing
time within redis.

> After raising the connection timeout to 10 seconds,
> (red:set_timeout(10000)), it appears that there are no "lua tcp socket read
> time out" errors with 100 concurrent connections, but when i try with 200
> concurrent connections, they are back.
>

Requests could queue in both the individual nginx worker processes and
the redis server process. You can use tools like tcp-recv-queue and
epoll-loop-block-distr on both sides to measure the queuing effects if
you're on Linux:

https://github.com/agentzh/nginx-systemtap-toolkit#tcp-recv-queue
https://github.com/agentzh/stapxx#epoll-loop-blocking-distr

You can also generate the on-CPU and off-CPU flamegraphs for both your
nginx workers and redis servers under load to look for spots that can
be optimized to raise the throughput limit:

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt
https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt-off-cpu

For high-performance web apps, the throughput limit is everything.
When exceeding the throughput limit, both the concurrency level and
the request latency will drop immediately. It is even more so when you
use tools like ab that always beats the server to its throughput
limit.

Best regards,
-agentzh
Reply all
Reply to author
Forward
0 new messages