Redis constant timeout errors

1,156 views
Skip to first unread message

rob dt

unread,
Jul 28, 2011, 5:31:32 PM7/28/11
to Redis DB
Hi list

I run a relatively high-traffic site with a fairly heavy Redis element. 

Setup in brief:

EC2 m1.large running redis-server, plus some Unicorn clients and Resque background worker clients (connecting to Redis)
EC2 m1.large with Unicorn connections and Resque background worker clients (connecting to Redis on first server)
2 x additional servers running various processes with connecting to Redis on first server

Redis database size is currently about 950MB. Client timeout in redis.conf is 0, and no vm usage

The problem: I have constant problems with Redis timeouts. Redis seems to lock itself somewhere, causing Unicorn requests to pile up for a few seconds before being processed in one go.

I have had to turn up the timeout on client side high on Unicorn processes to avoid the website falling over.

However Resque processes constantly die with "Timeout reading from the socket" exceptions.

I don't feel it should be necessary to have a high client timeout anywhere. A quick MONITOR suggests about 700 Redis requests/second during normal traffic.

Is this very high? Is there something fundamentally wrong with my setup? What could be causing Redis to back up and timeout like this? What are the best practices for running Redis with this level of traffic?

many thanks

Riyad

unread,
Jul 28, 2011, 7:52:52 PM7/28/11
to Redis DB
Rob, during these lockups what are your %wa and %st values in the
instance as reported by top?

If %wa is high (around or above 1) you are thrashing to disk and
likely running out of memory or *something* is deciding to write to
disk.

If %st is high then other VMs on that same host are hammering the
physical box and temporarily blocking you out of sufficient resources.

Another though, if you are using an EBS vol backing your instance and
don't have it in a software RAID across a few EBS vols, my guess is
that if you use iostat you'll see some big bursts of lag with EBS
coincide with those times when your instance seems to choke. It could
be redis writing out a snapshot or an append log and trying to send
that over the internal network to the EBS disk is lagging way behind.

I'd just be curious about more operating metrics of the machine when
these pauses happen.

Rob

unread,
Jul 29, 2011, 6:00:14 AM7/29/11
to redi...@googlegroups.com
Hi Riyad - 
The lockups don't seem to particularly correlate with when the site is very busy traffic-wise. I will watch next time we get a burst of them and report the results.

We are using EBS (on an m1.large) with no RAID etc



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


Alexander Gladysh

unread,
Jul 29, 2011, 6:30:50 AM7/29/11
to redi...@googlegroups.com
В

29.07.2011, в 14:00, Rob <robert...@gmail.com> написал(а):

Alexander Gladysh

unread,
Jul 29, 2011, 6:32:37 AM7/29/11
to Alexander Gladysh, redi...@googlegroups.com
Sorry, sent accidentally, please disregard. 

29.07.2011, в 14:30, Alexander Gladysh <agla...@gmail.com> написал(а):

Rob

unread,
Aug 10, 2011, 5:59:26 AM8/10/11
to redi...@googlegroups.com
Riyad,

I watched top through a spate of these lockups this morning.

%st is usually less than 10%, sometimes rising to 30%

%wa however is often above 1%, typically hitting 40-60% for several seconds during a lockup.

So I'm pretty sure that disk IO is the culprit here. But is it likely that Redis itself is responsible for this?

If not, is there a way to monitor the disk IO of individual processes so I can figure out what is causing the load?

Didier Spezia

unread,
Aug 10, 2011, 8:03:47 AM8/10/11
to redi...@googlegroups.com

Hi Rob,

I would start by reviewing the persistency options set in the Redis configuration file.
Is bgsave activated? Is AOF activated? With which flushing policy? etc ...
I/Os can be generated by read/write/sync operations, but also by mapped memory
and swapping. So I would check whether the instance is swapping or not when 
Redis is unresponsive.

700 q/s is quite low for a Redis instance, even on a VM.

To find which process is sucking out all your I/O bandwidth, you need a kernel
with "per process" I/O accounting activated (compile time options if I'm not
mistaken). If the kernel is not too old, it should be activated by default.
Then you can use a tool like htop to display IO_RATE, IO_READ_RATE,
IO_WRITE_RATE and identify the processes. I hope it will work with EBS
but I'm not even sure.

Regards,
Didier.

Rob

unread,
Aug 11, 2011, 4:34:56 PM8/11/11
to redi...@googlegroups.com
Hey Didier

Thanks for all the information.

I installed iotop and watched it alongside top during a long spell of poor performance and lengthy timeouts.

It didn't take long to confirm that the problem is triggered by Redis snapshotting itself. 

Our database size is approximately 1GB, and Redis is set to snapshot roughly every 2 minutes during normal load. 

Every time Redis forks to save, the disk IO shown in iotop for Redis reaches 99.99%, and soon afterwards is reflected in top's %wa figure. As soon as the snapshot concludes, disk IO quickly returns to normal levels.

There are several contributory factors here I think.. EC2's poor disk IO performance, my Redis settings being too eager to snapshot are the primary causes I suspect.

I'm guessing that appendonly mode would help here? Data integrity is not paramount in our system, so would fsyncing never be the most efficient option? How often does the data typically get written by the OS with this configuration?

thanks.

R



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/FrHrmhm2grgJ.

Didier Spezia

unread,
Aug 11, 2011, 5:09:00 PM8/11/11
to redi...@googlegroups.com

appendonly may help, if you have a lot of reads and not many writes
on the Redis instance (if you use Resque I doubt it, though).
"fsync never" is obviously the most efficient option for AOF, but please
note even in that mode, data are systematically written (with write)
after each event loop iteration (but it is not sync'd). A write operation
can also block when there is pressure on the filesystem cache.

Also, the AOF mechanism will regularly trigger a job to rewrite and
compact the file. It can also generate a peak of I/Os at that time.
Be sure to set appendfsync and no-appendfsync-on-rewrite to no
if you want to experiment with this configuration.

Otherwise you can simply increase the interval of the bgsave
(save parameter) to something more in line with EBS performance.

Regards,
Didier.

Pieter Noordhuis

unread,
Aug 12, 2011, 4:47:45 AM8/12/11
to redi...@googlegroups.com
Hi Rob,

I'm starting to think this may be related to issue #602. Can you be
more specific when the lockup happens: together with high %io,
together with high %wa or both? I suspect it only occurs together with
%wa, where Redis renames the new dump and unlinks the old one. If the
disk performance is poor, this may block Redis a bit longer than it
should, resulting in timeouts. Can you confirm if my suspicion is
correct?

Thanks,
Pieter

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/redis-db/-/AuRRHiFucrgJ.

Pieter Noordhuis

unread,
Aug 12, 2011, 4:53:03 AM8/12/11
to redi...@googlegroups.com
You can easily confirm this is indeed the lockup by executing the
following strace side-by-side your top/iotop/...:

strace -ttt -T -p $(pgrep redis-server) -e"trace=open,close,unlink,rename"

When a background dump or a background AOF rewrite finished you will
see something along the lines of:

1313138997.138010 --- SIGCHLD (Child exited) @ 0 (0) ---
1313138997.142032 open("temp-rewriteaof-bg-31724.aof",
O_WRONLY|O_APPEND|O_LARGEFILE) = 7 <0.000016>
1313138997.142226 rename("temp-rewriteaof-bg-31724.aof",
"appendonly.aof") = 0 <0.000057>
1313138997.142407 close(6) = 0 <0.143789>
1313138997.286558 unlink("temp-rewriteaof-bg-31724.aof") = -1 ENOENT
(No such file or directory) <0.000024>

Note that close() takes 140ms here. I suspect close() to take more or
less the time your Redis instance is locked. If this is the case, we
know for sure.

Thanks,
Pieter

Reply all
Reply to author
Forward
0 new messages