How can connected clients grow?

David Montgomery

unread,

Oct 27, 2012, 8:52:31 AM10/27/12

to redi...@googlegroups.com

Hi,

Currently, my web app is taking 3K qps. I use uWSGI with the gevent
loop and the redis-py client so its not blocking..I am using monky
patching. I have 1 master and two slaves. I am 100% ram, no
snapshots, no appends.

My app has a strict deadline of 100ms. I make two reads and have two
writes. I do the writes after a finish with the request. I have a .09
ms timeout in the redis py connection and even further use gevent to
kill the connection if taking too long.

My avg round trip time from the time I receive a post to the time the
sender receives is about 20ms. This is a fast app. From the redis
config, its a basic install of 2.6. Only mod I have is to turn of
saves and append only.

def RedisGet(push):
try:
r.get('global')
except:
pass

jt = gevent.spawn(RedisGet,mab_profile)
jt.join(.09)
jt.kill(block=False)

My app will work smooth for 2-3 hours. The avg number of connected
clients to the master is around 300.

Out of the blue, connected clients can grow, some times to 10,000 if
not checked and I will to restart uWSGI. High number of connected
clients trigger failovers. These connections come from the web
server. This happens simites once an hour or every 2-3 hours.

I am kinda stumped on what the issue is and how to debug.
Statistically speaking the traffic is constant. I am using data pipe
cloud on there top tier machines. I have hardly any keys and the ram
is 16 gigs on a two core machine. One core is idle.

I have my code in try and except clauses....time outs etc. I dont
know how to debug it this point.

Any wisdom is much needed.

Josiah Carlson

unread,

Oct 27, 2012, 1:10:51 PM10/27/12

to redi...@googlegroups.com

If Redis starts to lag*, it will cause many of your connections to get
killed (but not necessarily cause the connections to be closed
immediately). Those killed connections will result in connections not
being recycled, but being created (as expected). Connection creation
is 1 1/2 round-trips for TCP, after which your app can send the data.
Those connection attempts will increase the likelihood of timing out
on all requests. This will cause snowballing to the result that you
are experiencing.

* Lag can be caused by the Linux scheduler deciding that another
process should get time, maybe that the process should be on a
different core, can be caused by a slave reconnecting (causing a fork,
and temporary delay), the network being saturated with slave sync
traffic, any one of a number of network hiccups (which can be caused
by solar flares), etc.

There are two simple solutions to this problem.
1. Increase your after-request write timeouts from .09 milliseconds to
1 millisecond (or even 10 milliseconds) to reduce the chances of
momentary lag running away on you
2. Defer your writes to a task processor, and effectively remove
timeouts, but measure the latency for reporting/alert purposes

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

Tampa312

unread,

Oct 27, 2012, 9:54:54 PM10/27/12

to redi...@googlegroups.com

Thanks,

Does it make sense to change the nice priority on a redis server? As of now I am default. Also, what is best practice for sysctl.conf file? Are there any recommended settings?

Is there a difference between using gevent to kill a connection versus a redis-py timeout or are the same.

Thanks

Didier Spezia

unread,

Oct 28, 2012, 3:49:27 AM10/28/12

to redi...@googlegroups.com

>> Does it make sense to change the nice priority on a redis server? As of now I am default.

Not really with a stock Redis version, because this priority will be also inherited by Redis background threads or processes.

And you don't want the bgsave or aof rewrite tasks to hog your CPU.
When I want to optimize scheduling, I have to patch Redis so that the main thread uses the real-time scheduler while

the background threads/processes are kept in the normal time-sharing scheduling class.

>> Also, what is best practice for sysctl.conf file? Are there any recommended settings?

AFAIK, no specific recommendation except setting vm.overcommit_memory to 1

If you use persistency, you may want to set vm.swappiness to zero to avoid Redis

memory to be swapped out though.

>> Is there a difference between using gevent to kill a connection versus a redis-py timeout or are the same.

I would rather rely on a redis-py timeout than killing the coroutine.

I'm not even sure what happens to the file descriptor of the connection

(whether it is properly removed from the event loop), when the coroutine is killed.

That said, whatever the mechanism, setting aggressive timeouts usually does not

work very well. Linux is not a real time operating system. It cannot guarantee

that all your Redis commands will be executed in less than 100 ms. The average

latency is much lower than 100 ms, but you may (and you will) have outliers even

on a moderately loaded system.

Personally, I never put less than 2 *seconds* for my communication timeouts.

And even with this value, we are sometimes hit with disconnection storms when

the machines become temporarily unresponsive.

Best regards,

Didier.

Josiah Carlson

unread,

Oct 28, 2012, 9:41:02 PM10/28/12

to redi...@googlegroups.com

On Sun, Oct 28, 2012 at 12:49 AM, Didier Spezia <didi...@gmail.com> wrote:
>>> Is there a difference between using gevent to kill a connection versus a
>>> redis-py timeout or are the same.
>
> I would rather rely on a redis-py timeout than killing the coroutine.
> I'm not even sure what happens to the file descriptor of the connection
> (whether it is properly removed from the event loop), when the coroutine is
> killed.
>
> That said, whatever the mechanism, setting aggressive timeouts usually does
> not
> work very well. Linux is not a real time operating system. It cannot
> guarantee
> that all your Redis commands will be executed in less than 100 ms. The
> average
> latency is much lower than 100 ms, but you may (and you will) have outliers
> even
> on a moderately loaded system.
>
> Personally, I never put less than 2 *seconds* for my communication timeouts.
> And even with this value, we are sometimes hit with disconnection storms
> when
> the machines become temporarily unresponsive.

Agreed on all 3 points here, an all other points Didier made. If my
writes can occur after the fact, I defer the writes to task queues or
some other deferred execution mechanism, and usually leave it to 5+
second timeouts.

Regards,
- Josiah

David Montgomery

unread,

Oct 28, 2012, 10:14:13 PM10/28/12

to redi...@googlegroups.com

Thanks for the advice...helped a lot I did change the nice to -1. I
now have no issues. Also fudged the nginx and uWSGI nice values. I
am doing 3K qps on posts, make two redis reads and make every request
in less than 20ms. Wow....thats kinda fast! No bgsaves or
appendonly. My think time though by the time I submit a request is on
avg 4ms. The 20ms is the round trip time reported by the sender to
me. So...with at min two reads...redis is blazing fast.

For writing after the fact...yeah ..aggreed..will write to another
process using zmq to do post clean up and will be more gentle on the
timeouts.

My big issue then is to handle as gracefully as possible the redis
reads. Again...have a 90ms timeout on the redis client but use gevent
to monitor the execution time.

jt = gevent.spawn(execute_get_cookie_meta,cookie_id)
jt.join(.09)
jt.kill(block=False)
cookie_meta = jt.value

So...again..I am no expert but I image that is a brutal kill? Would
it better ..... even if a time out at least let the request finish so
as to recycle request? rather than creating a new connection? As
mentioned? If so..need to figure a graceful method.

So ...on each of the redis reads...I calculate the total time
elapsed. So...read 1 will have 90ms. If read takes 20ms then I have
70ms for read 2 etc.

Now my connected clients are stable with on avg 180 clients. No death
spirals that shoot up to 10,000 anymore...thank god...

Reply all

Reply to author

Forward