Asynchronous AOF fsync is taking too long

Alexander Gladysh

unread,

Sep 28, 2012, 3:28:16 PM9/28/12

to redi...@googlegroups.com

Hi, list!

We use Redis 2.4.14 on Ubuntu 11.04, sitting inside Xen XCP on
Hetzner-hosted physical server (more details at the end of the
letter).

This particular Redis instance handles about 2K commands per second
(more in peak hours). These commands are distributed as follows:

$ redis-cli monitor | head -n 2000 | awk '{print $4;}' | sort | uniq
-c | sort -nr
837 "HINCRBY"
713 "PING"
376 "GET"
18 "MULTI"
18 "INCR"
18 "EXPIREAT"
18 "EXEC"
2

(I believe that the proportion stays similar, regardless of load. High
number of PINGs is due to a workaround for a flaw of a persistence
connection implementation on client side.)

Each time this Redis instance does auto-BGREWRITEAOF, we get a
significant slowdown. Rewrite goes on for a considerable time.

In Redis log we see this:

[11882] 26 Sep 13:44:42 * Starting automatic rewriting of AOF on 100% growth
[11882] 26 Sep 13:44:43 * Background append only file rewriting
started by pid 20591
[11882] 26 Sep 13:44:48 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[11882] 26 Sep 13:44:51 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[20591] 26 Sep 13:45:03 * SYNC append only file rewrite performed
[11882] 26 Sep 13:45:03 * Background AOF rewrite terminated with success
[11882] 26 Sep 13:45:03 * Parent diff successfully flushed to the
rewritten AOF (1560565 bytes)
[11882] 26 Sep 13:45:03 * Background AOF rewrite successful

If we drop auto-aof-rewrite-percentage to 50%, we still get similar results:

[20694] 28 Sep 17:33:35 * Starting automatic rewriting of AOF on 50% growth
[20694] 28 Sep 17:33:36 * Background append only file rewriting
started by pid 11105
[20694] 28 Sep 17:33:40 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[20694] 28 Sep 17:33:43 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[20694] 28 Sep 17:33:47 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[11105] 28 Sep 17:33:50 * SYNC append only file rewrite performed
[20694] 28 Sep 17:33:51 * Background AOF rewrite terminated with success
[20694] 28 Sep 17:33:51 * Parent diff successfully flushed to the
rewritten AOF (723267 bytes)
[20694] 28 Sep 17:33:51 * Background AOF rewrite successful

During the time when AOF is running, Redis seems to be saturating disk
write channel with 50 MB/s of data. Here is a snapshot from atop:

PID RUID EUID THR SYSCPU USRCPU VGROW
RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD
1/9
11882 redis redis 3 20.26s 8.63s 0K
-292K 0K 51076K -- - S 0 2%
redis-server
12924 zabbix zabbix 1 0.22s 0.08s 0K
0K 0K 0K -- - S 0 0%
zabbix_agentd
501 root root 1 0.00s 0.06s 0K
0K 0K 0K -- - S 0 0%
irqbalance
177 root root 1 0.04s 0.00s 0K
0K 0K 672K -- - S 0 0%
jbd2/xvda1-8
526 root root 1 0.01s 0.02s 0K
0K 0K 144K -- - S 0 0%
xe-daemon
4057 root root 1 0.02s 0.00s 0K
0K 4K 12K -- - R 0 0% atop
19591 root - 0 0.02s 0.00s 0K
0K - - NE 0 E - 0%
<xe-update-gu>
19664 root - 0 0.01s 0.01s 0K
0K - - NE 0 E - 0%
<xe-update-gu>
19780 root - 0 0.02s 0.00s 0K
0K - - NE 0 E - 0%
<xe-update-gu>
19810 root - 0 0.02s 0.00s 0K
0K - - NE 0 E - 0%
<xe-update-gu>
12930 zabbix zabbix 1 0.01s 0.00s 0K
0K 0K 0K -- - S 0 0%
zabbix_agentd
9228 root root 1 0.01s 0.00s 0K
0K 0K 0K -- - S 0 0%
runsvdir

We're working on separating this Redis instance to several ones, but
that will take time. Is there something else that we can do to ease
the HDD load? Note that we can not quickly upgrade to 2.6 (too much
testing needed, also it is still unstable), but may upgrade to a more
recent 2.4 if that'll help.

Any hints are appreciated. Please tell me if I can provide additional
diagnostics and / or run some experiments.

With best regards,
Alexander.

$ uname -a
Linux MYHOST 2.6.38-15-virtual #59-Ubuntu SMP Fri Apr 27 16:38:04 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux

$ redis-cli info
redis_version:2.4.14
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.5.2
process_id:20694
uptime_in_seconds:29848
uptime_in_days:0
lru_clock:668190
used_cpu_sys:1364.73
used_cpu_user:648.88
used_cpu_sys_children:7.07
used_cpu_user_children:5.48
connected_clients:34
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:1391666336
used_memory_human:1.30G
used_memory_rss:1364901888
used_memory_peak:1398449032
used_memory_peak_human:1.30G
mem_fragmentation_ratio:0.98
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:1
changes_since_last_save:38263529
bgsave_in_progress:0
last_save_time:1348829333
bgrewriteaof_in_progress:0
total_connections_received:9686
total_commands_processed:67571388
expired_keys:0
evicted_keys:0
keyspace_hits:6039485
keyspace_misses:7485175
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:240648
vm_enabled:0
role:master
aof_current_size:1310274432
aof_base_size:1304818998
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
db0:keys=1,expires=0
db2:keys=60,expires=0
db3:keys=4717,expires=0
db4:keys=30,expires=10

Felix Gallo

unread,

Sep 28, 2012, 3:52:12 PM9/28/12

to redi...@googlegroups.com

Ah, the cloud.

The 'fastest' way to deal with this is to turn persistence off on your production server, and set up a (networked, not on the same hypervisor) slave to handle persistence. It looks like the iops on your disk is relatively poor; are there any other VMs on this box that might be hogging the disk or hitting it hard enough to confuse the situation?

The next-most-pragmatic solution would be to disable aof rewrite completely and do it either offline or during non-peak hours.

F.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Alexander Gladysh

unread,

Sep 28, 2012, 4:00:23 PM9/28/12

to redi...@googlegroups.com

Hi, Felix,

On Fri, Sep 28, 2012 at 11:52 PM, Felix Gallo <felix...@gmail.com> wrote:
> Ah, the cloud.
>
> The 'fastest' way to deal with this is to turn persistence off on your
> production server, and set up a (networked, not on the same hypervisor)

Well... I assume that any other Hetzner machine will have the same HDD
write performance. So I do not quite follow. See also below.

> slave to handle persistence. It looks like the iops on your disk is
> relatively poor;
> are there any other VMs on this box that might be hogging
> the disk or hitting it hard enough to confuse the situation?

I have to double-check this, but no, they should not.

I assumed that 50MB/s is about as much as HDD can do. Am I too far in the past?

> The next-most-pragmatic solution would be to disable aof rewrite completely
> and do it either offline or during non-peak hours.

The problem is that even in non-peak hours the load is pretty high.

Thanks,
Alexander.

Javier Guerra Giraldez

unread,

Sep 28, 2012, 4:16:37 PM9/28/12

to redi...@googlegroups.com

On Fri, Sep 28, 2012 at 3:00 PM, Alexander Gladysh <agla...@gmail.com> wrote:
>> The 'fastest' way to deal with this is to turn persistence off on your
>> production server, and set up a (networked, not on the same hypervisor)
>
> Well... I assume that any other Hetzner machine will have the same HDD
> write performance. So I do not quite follow. See also below.

maybe not your specific problem, but delegating persistence to a
separate machine prevents the low performance of fork() from affecting
the master. Xen systems in particular are known to stall for seconds
when fork()ing a multi-gigabyte process

--
Javier

Alexander Gladysh

unread,

Sep 28, 2012, 4:19:04 PM9/28/12

to redi...@googlegroups.com

Looks like not my problem — Redis complains about fsync(), not fork(),
and it still manages to produce 50MB/s of data... so...

But point taken. However, backing each Redis guest with a physical
server seems to be not very cost-effective...

Thanks,
Alexander.

Felix Gallo

unread,

Sep 28, 2012, 4:44:07 PM9/28/12

to redi...@googlegroups.com

The benefit of setting up another machine to handle persistence, even if it has the same disk characteristics:

* a different set of cores is doing the work

* typically the buffering that happens in between the master and the slave is sufficient that even if the slave has to spend many seconds doing something big and complicated, the slave will be able to catch up to the master's position rapidly

* the pause on the slave does not impact the master

Obviously there are a lot of variables to worry about with several possible failure modes (e.g., what if rewriting takes up so much time that the slave keeps losing ground; what if the network goes down; ...), but unfortunately the cloud optimizes for ease of setup and not for raw horsepower or simplicity of solution; you must sometimes make up for its shortcomings.

It sounds like the Hetzner disks (and/or your use of them) are very suboptimal; that's a shame. You might spring for the SSD option which Hetzner does in fact provide: http://www.hetzner.de/en/hosting/news/ssd-festplatte/.

F.

Felix Gallo

unread,

Sep 28, 2012, 4:46:27 PM9/28/12

to redi...@googlegroups.com

I might add -- I too use the cloud, and I use the nonpersisting masters/persisting slaves pattern (all machines are identical) and it works great for me with 32 gigabytes of live data and 750,000 users. AOF takes many seconds, but the master never notices.

F.

Alexander Gladysh

unread,

Sep 28, 2012, 5:08:19 PM9/28/12

to redi...@googlegroups.com

On Sat, Sep 29, 2012 at 12:44 AM, Felix Gallo <felix...@gmail.com> wrote:
<snip>

Thank you for your advice, we'll think about that. Still, I'm looking
for other ways to tune existing installation — even if we'll move
Redis to a dedicated.

> It sounds like the Hetzner disks (and/or your use of them) are very
> suboptimal; that's a shame.

Care to elaborate? What write performance would you expect on a pair
of non-SSD HDDs in RAID1?

Thanks,
Alexander.

Felix Gallo

unread,

Sep 28, 2012, 5:47:27 PM9/28/12

to redi...@googlegroups.com

Hetzner appears to use commodity 7200 RPM disks. Here's a survey of raw write throughput for 7200s in general:

http://www.tomshardware.com/charts/hdd-charts-2012/-04-Write-Throughput-Average-h2benchw-3.16,2904.html

Note that 50 MB/s is a little less than Western Digital's low budget economy 'green' 5400 RPM product. Now, you are paying the RAID-1 price of two writes; could be that's hurting you too.

By comparison, here's an SSD chart from last year:

http://www.tomshardware.com/charts/ssd-raid-0-charts-2011/CrystalDiskMark-3.0-x64,2624.html

Note that a single older SSD drive in non-RAID configuration gets you about 4x your performance; 5 drives in RAID-0 gets you 700 MB/s, which is 14 times your performance.

F.

Thanks,
Alexander.

Alexander Gladysh

unread,

Sep 29, 2012, 12:47:14 AM9/29/12

to redi...@googlegroups.com

On Sat, Sep 29, 2012 at 1:47 AM, Felix Gallo <felix...@gmail.com> wrote:
> Hetzner appears to use commodity 7200 RPM disks. Here's a survey of raw
> write throughput for 7200s in general:
>
> http://www.tomshardware.com/charts/hdd-charts-2012/-04-Write-Throughput-Average-h2benchw-3.16,2904.html
>
> Note that 50 MB/s is a little less than Western Digital's low budget economy
> 'green' 5400 RPM product. Now, you are paying the RAID-1 price of two
> writes; could be that's hurting you too.

So, on a single disk we should see 75—100 MB/s, right?

Thanks!
Alexander.

Felix Gallo

unread,

Sep 29, 2012, 2:42:50 AM9/29/12

to redi...@googlegroups.com

On real metal, unvirtualized, with tuning, yes.

Matthew Palmer

unread,

Sep 29, 2012, 4:37:03 AM9/29/12

to redi...@googlegroups.com

On Fri, Sep 28, 2012 at 11:42:50PM -0700, Felix Gallo wrote:
> On Sep 28, 2012 9:47 PM, "Alexander Gladysh" <agla...@gmail.com> wrote:
> > On Sat, Sep 29, 2012 at 1:47 AM, Felix Gallo <felix...@gmail.com> wrote:
> > > Hetzner appears to use commodity 7200 RPM disks. Here's a survey of raw
> > > write throughput for 7200s in general:
> > >
> > > http://www.tomshardware.com/charts/hdd-charts-2012/-04-Write-Throughput-Average-h2benchw-3.16,2904.html
> > >
> > > Note that 50 MB/s is a little less than Western Digital's low budget
> > > economy 'green' 5400 RPM product. Now, you are paying the RAID-1
> > > price of two writes; could be that's hurting you too.
> >
> > So, on a single disk we should see 75—100 MB/s, right?
>
> On real metal, unvirtualized, with tuning, yes.

Doing streaming writes. Which writing to a filesystem most assuredly is
not.

- Matt

--
A friend is someone you can call to help you move. A best friend is someone
you can call to help you move a body.

Yiftach Shoolman

unread,

Sep 29, 2012, 7:03:25 AM9/29/12

to Matthew Palmer, redi...@googlegroups.com

You can also tune the “auto-aof-rewrite-percentage” and the
“auto-aof-rewrite-min-size” options to optimize the occurrence of the
rewrite events.

M. Edward (Ed) Borasky

unread,

Sep 29, 2012, 12:13:18 PM9/29/12

to redi...@googlegroups.com

On Sat, Sep 29, 2012 at 1:37 AM, Matthew Palmer <mpa...@hezmatt.org> wrote:
> On Fri, Sep 28, 2012 at 11:42:50PM -0700, Felix Gallo wrote:
>> On Sep 28, 2012 9:47 PM, "Alexander Gladysh" <agla...@gmail.com> wrote:
>> > On Sat, Sep 29, 2012 at 1:47 AM, Felix Gallo <felix...@gmail.com> wrote:

[snip]

>> > So, on a single disk we should see 75—100 MB/s, right?
>>
>> On real metal, unvirtualized, with tuning, yes.
>
> Doing streaming writes. Which writing to a filesystem most assuredly is
> not.

Depends on which filesystem you use (on Linux), how fragmented it is
with regard to seeks and how big the buffers and caches are. You can
design something that can capture data to a magnetic surface at these
rates. And with "blktrace", you can measure its effectiveness.

--
Twitter: http://twitter.com/znmeb; Computational Journalism Publishers
Workbench: http://j.mp/QCsXOr

How the Hell can the lion sleep with all those people singing "A weem
oh way!" at the top of their lungs?

Reply all

Reply to author

Forward