Hello!
On Tue, Feb 12, 2013 at 5:17 AM, NinjaPenguin wrote:
> Hi Agentzh
>
Please do not capitalize my nick. Thank you.
> Firstly thanks very much for getting back to me!
>
> The storage size is 100M (I dramatically high balled it to ensure I had
> space). Testing today with the added call to flush_expired did indeed seem
> to remove the "ngx_slab_alloc() failed: no memory" msg - so thanks very much
> for that!
>
Good to know :)
> I am still seeing the issue with calls essentially being lost (and not being
> submitted to Gearman) - the more I think about this though the more I
> believe it is due to the lack of atomicity within the shared dict. I believe
> in the time between making a submission to Gearman and calling flush, other
> processes are probably writing to the space and so they are then
> subsequently flushed without having been written.
>
Atomicity is only guaranteed on the method call level. That is, "get"
is atomic, "set" is atomic, but the calling sequence of "get" and
"set" is not.
If you want to lock a sequence of calls, you have to emulate a
high-level lock yourself as discussed here:
https://groups.google.com/group/openresty-en/browse_thread/thread/4c91de9fc25dd2d7/6fdf04d24f12443f
Maybe we can eventually implement a builtin transaction API as in
Redis here in shared dict :)
> Its possible that I could work around this with a basic lock method using
> the :get call, but i'm undecided on how exactly this would work, and the
> performance impact it may have
>
See above. Also, group your shared dict operations together, do not do
I/O in the middle of locking.
> For now I have simply removed the chunking of these jobs and now submit on
> each request
>
> This did however reveal a subsequent issue with connections to redis (I'm
> using the redis2 module) and the current upstream configuration:
>
> upstream redis {
> server unix:/redis-6406/redis.sock;
>
> # a pool with at most 4096 connections
> keepalive 4096;
> }
>
> At high load this results in a number of:
>
> [error] 5134#0: *67436 connect() to unix:/redis-6406/redis.sock failed (11:
> Resource temporarily unavailable) while connecting to upstream, client:
> XXX.XXX.XXX.XXX, server: test.io, request: "GET /test/mode:direct HTTP/1.1",
> subrequest: "/redis_creative_get", upstream:
> "redis2://unix:/redis-6406/redis.sock:", host: "XXX.XXX.XXX.XXX"
>
> This may be a natural limit possibly (I'm seeing this at a traffic load of
> around 15K/s) but was wondering if there was anything I could be looking at
> to further tune this? (I should note that redis-server sees very low load at
> this point)
>
It seems like your Redis server is just not catching up with the
traffic. You can consider tuning the Redis configurations, especially
enlarging the "backlog" setting and/or just sharding across multiple
Redis server instances.
BTW, you should get better performance with the lua-resty-redis
library instead of subrequesting to ngx_redis2 in Lua.
For high throughput like this, properly setting CPU affinity on both
the Nginx workers and the local backend servers like Redis will boost
performance dramatically in practice.
Also, if the CPU resource is the bottleneck, then please consider
doing Flame Graphs on your daemon processes eating CPU time can also
give you a lot of clues about further performance improvements. For
example, the following tools can be used to render Flame Graphs by
sampling your live systems under load (both Nginx and other processes
like Redis) on Linux:
https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-bt
https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-lua-bt
Best regards,
-agentzh