Socket timeout?

Zoltán Nagy

unread,

Jun 10, 2011, 9:55:10 AM6/10/11

to redi...@googlegroups.com

Hello,

sorry about the subject, but I really don't know anything more specific about the cause of the problem. I'd like to ask for your help on debugging a strange issue.

We're using Predis as the client library, so of course the first step was to eliminate Predis as the culprit. Here is the issue in the Predis issue tracker: https://github.com/nrk/predis/issues/32

There's a pretty detailed description of the problem and investigation there, I'll try to sum up the results:

At seemingly random intervals we get exceptions for all connections to a Redis instance for a second or so. Some debugging points to the connection timing out.
The client works with a socket timeout of 30, and a read-write timeout of 60 seconds.
There are very few connections to the instance at any one time (20-30 at most)
The issue exists both with calls to a local or a remote instance
The errors don't seem to happen at the time of some specific resource intensive process

I don't even know where to start looking. I'd really appreciate any ideas and requests for more (specific) information.

Bret A. Barker

unread,

Jun 10, 2011, 11:25:58 AM6/10/11

to redi...@googlegroups.com

You aren't by chance doing any "keys" commands are you? We ran into a similar issue when we had some code inadvertently calling keys periodically and with millions of keys in our data set that was blocking long enough to cause timeouts for other clients.

-bret

On Fri, Jun 10, 2011 at 06:55:10AM -0700, Zolt??n Nagy wrote:
> Hello,
>
> sorry about the subject, but I really don't know anything more specific
> about the cause of the problem. I'd like to ask for your help on debugging a
> strange issue.
>
> We're using Predis as the client library, so of course the first step was to
> eliminate Predis as the culprit. Here is the issue in the Predis issue
> tracker: https://github.com/nrk/predis/issues/32
>
> There's a pretty detailed description of the problem and investigation
> there, I'll try to sum up the results:
>

> - At seemingly random intervals we get exceptions for all connections to

> a Redis instance for a second or so. Some debugging points to the connection
> timing out.

> - The client works with a socket timeout of 30, and a read-write timeout
> of 60 seconds.
> - There are very few connections to the instance at any one time (20-30
> at most)
> - The issue exists both with calls to a local or a remote instance
> - The errors don't seem to happen at the time of some specific resource

Zoltán Nagy

unread,

Jun 10, 2011, 11:29:11 AM6/10/11

to redi...@googlegroups.com

Actually, we are. (They're in legacy-waiting-for-rewrite status). This will take some time to verify, but it feels right.

Thank you!

Salvatore Sanfilippo

unread,

Jun 10, 2011, 11:45:48 AM6/10/11

to redi...@googlegroups.com

On Fri, Jun 10, 2011 at 3:55 PM, Zoltán Nagy <abe...@gmail.com> wrote:
> I don't even know where to start looking. I'd really appreciate any ideas
> and requests for more (specific) information.

Hello, this is the only things that I can think of:

1) It is very very unlikely that this is a Redis problem. Nothing is
impossible but such a huge bug in a production release not noticed
elsewhere is strange.
2) Do you have some kind of middleware between Redis and your client?
Proxies, SSL, something like that? I guess not but just to check.
3) What happens in the Redis logs when the connection gets closed?
4) Is it possible to simulate that in the same environment with just a
single client doing operations into a loop? That would make debugging
possibly simpler.
5) Upgrading to latest 2.2 is a good idea, even if there are no
strictly related bugs, but there are bugs in your version, so who
knows?
6) Another interesting try could be to upgrade to latest 2.4 branch
commit. Since there you have a CLIENT command than ca be used for
further debugging what happens.
7) You should try compiling 'redis-tools' (see
https://github.com/antirez/redis-tools) and use the following command
to check if the server somewhat stops working when you see the issue:

./redis-stat latency delay 100

8) If you can reproduce that in the loop, try switching in turn:
version of PHP, version of Predis, try with an entirely different
client library, to see if this resolves the issue.
9) Should be completely unrelated but also try to use 'timeout 0' in
your redis.conf.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Salvatore Sanfilippo

unread,

Jun 10, 2011, 11:47:15 AM6/10/11

to redi...@googlegroups.com

That would be evident in unstable using simply "INFO all". Now we log
commands and average execution times internally, and this info is
available using the INFO all command.

Cheers,
Salvatore

On Fri, Jun 10, 2011 at 5:25 PM, Bret A. Barker <br...@iwin.com> wrote:
> You aren't by chance doing any "keys" commands are you? We ran into a similar issue when we had some code inadvertently calling keys periodically and with millions of keys in our data set that was blocking long enough to cause timeouts for other clients.
>

--

Zoltán Nagy

unread,

Jun 10, 2011, 12:15:08 PM6/10/11

to redi...@googlegroups.com

On Friday, June 10, 2011 5:45:48 PM UTC+2, Salvatore Sanfilippo wrote:

On Fri, Jun 10, 2011 at 3:55 PM, Zoltán Nagy <abe...@gmail.com> wrote:
> I don't even know where to start looking. I'd really appreciate any ideas
> and requests for more (specific) information.
Hello, this is the only things that I can think of:
1) It is very very unlikely that this is a Redis problem. Nothing is
impossible but such a huge bug in a production release not noticed
elsewhere is strange.

True, that's one of the reasons why the KEYS idea looks promising.

2) Do you have some kind of middleware between Redis and your client?
Proxies, SSL, something like that? I guess not but just to check.

Nope, nothing like that. Actually we just started using unix sockets where applicable (didn't make the problem go away)

3) What happens in the Redis logs when the connection gets closed?

Checking this should have been an obvious idea. That makes me feel dumb. I checked three error occurrences, and there was a SYNC append only file rewrite in exactly the interval when the errors occurred. That makes me feel even dumber. This is another lead I'll check out, it's even more promising than the KEYS command.

4) Is it possible to simulate that in the same environment with just a
single client doing operations into a loop? That would make debugging
possibly simpler.

My attempts to reproduce the issue in this way failed, but either the KEYS or the append only file rewrite scenario can account for that (bad timing)

5) Upgrading to latest 2.2 is a good idea, even if there are no
strictly related bugs, but there are bugs in your version, so who
knows?

Will do.

6) Another interesting try could be to upgrade to latest 2.4 branch
commit. Since there you have a CLIENT command than ca be used for
further debugging what happens.

I don't think that's possible: the issue is only present on the production system, and of course I don't want to deploy a bleeding-edge version there.

I'll try the rest of your suggestions if the KEYS and the rewrite turn out to be false leads.

Thank you very much :)

Bret A. Barker

unread,

Jun 10, 2011, 1:38:35 PM6/10/11

to redi...@googlegroups.com

Is that going to be in the 2.4 release?

-bret

Salvatore Sanfilippo

unread,

Jun 10, 2011, 4:49:40 PM6/10/11

to redi...@googlegroups.com

On Fri, Jun 10, 2011 at 7:38 PM, Bret A. Barker <br...@iwin.com> wrote:
> Is that going to be in the 2.4 release?

No sorry... that's into unstable land, not easy to backport without
moving too much code around.

Cheers,
Salvatore

Zoltán Nagy

unread,

Jun 14, 2011, 7:08:44 AM6/14/11

to redi...@googlegroups.com

I set the timeout to 0 via CONFIG SET timeout 0, and logged the latency with redis-tools. We have the error even so, and the latency didn't rise above 2ms at all, and was mostly between 0.1-0.5ms.

Next up is the upgrade to 2.2.9.

Background saving and AOF rewrite can't have this effect in theory, am I right?

I haven't yet finished analyzing the log with all the KEYS calls. So the options are

* the upgrade will solve the problem

* there are KEYS calls that block the Redis instance; if so, we need to redesign these

* try to reproduce the error with another client library ( = another language, or system calls to redis-cli)