How to measure fork() time on bgrewriteaof?

Alexander Gladysh

unread,

Oct 14, 2012, 4:45:26 AM10/14/12

to redi...@googlegroups.com

Hi, list!

I would like to measure fork() time on bgrewriteaof to prove if I'm
experience "slow fork() on Xen" problem.

Redis 2.4.14 logs only mention when rewrite is performed, not when it
is started (maybe it can be improved?):

[755] 14 Oct 12:27:57 * Background append only file rewriting started
by pid 21552
[755] 14 Oct 12:28:01 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[755] 14 Oct 12:28:03 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[755] 14 Oct 12:28:06 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[21552] 14 Oct 12:28:07 * SYNC append only file rewrite performed
[755] 14 Oct 12:28:07 * Background AOF rewrite terminated with success
[755] 14 Oct 12:28:07 * Parent diff successfully flushed to the
rewritten AOF (436762 bytes)
[755] 14 Oct 12:28:07 * Background AOF rewrite successful

Is there a better way to measure it than to stare at strace output?

Thanks,
Alexander.

Yiftach Shoolman

unread,

Oct 14, 2012, 5:15:20 AM10/14/12

to redi...@googlegroups.com

AFAIK, look for latest_fork_usec in the Redis info output

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--

Yiftach Shoolman
+972-54-7634621

Alexander Gladysh

unread,

Oct 14, 2012, 5:22:11 AM10/14/12

to redi...@googlegroups.com

On Sun, Oct 14, 2012 at 1:15 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:

> On Sun, Oct 14, 2012 at 10:45 AM, Alexander Gladysh <agla...@gmail.com>
> wrote:

>> I would like to measure fork() time on bgrewriteaof to prove if I'm
>> experience "slow fork() on Xen" problem.
>>
>> Redis 2.4.14 logs only mention when rewrite is performed, not when it
>> is started (maybe it can be improved?):

> AFAIK, look for latest_fork_usec in the Redis info output

latest_fork_usec:119206

That is 0.0001s, right? This means that fork is not a problem in my setup...

Alexander.

Salvatore Sanfilippo

unread,

Oct 14, 2012, 5:25:32 AM10/14/12

to redi...@googlegroups.com

On Sun, Oct 14, 2012 at 11:22 AM, Alexander Gladysh <agla...@gmail.com> wrote:

> That is 0.0001s, right? This means that fork is not a problem in my setup...

No it is 0.1 seconds, so 100 milliseconds, still not a big deal
depending on your latency requirements.

Cheers,
Salvatore

> Alexander.
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

--

Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
— David Gelernter

Alexander Gladysh

unread,

Oct 14, 2012, 5:39:56 AM10/14/12

to redi...@googlegroups.com

On Sun, Oct 14, 2012 at 1:25 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> On Sun, Oct 14, 2012 at 11:22 AM, Alexander Gladysh <agla...@gmail.com> wrote:
>
>> That is 0.0001s, right? This means that fork is not a problem in my setup...
>
> No it is 0.1 seconds, so 100 milliseconds, still not a big deal
> depending on your latency requirements.

Well, that's more or less bearable. So, 10 seconds for BGREWRITEAOF
here is probably HDD after all... (weird, dataset is about a
gigabyte...)

[755] 14 Oct 12:27:57 * Background append only file rewriting started
by pid 21552
[755] 14 Oct 12:28:01 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[755] 14 Oct 12:28:03 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[755] 14 Oct 12:28:06 * Asynchronous AOF fsync is taking too long
(disk is busy?). Writing the AOF buffer without waiting for fsync to
complete, this may slow down Redis.
[21552] 14 Oct 12:28:07 * SYNC append only file rewrite performed
[755] 14 Oct 12:28:07 * Background AOF rewrite terminated with success
[755] 14 Oct 12:28:07 * Parent diff successfully flushed to the
rewritten AOF (436762 bytes)
[755] 14 Oct 12:28:07 * Background AOF rewrite successful

Alexander.

Alexander Gladysh

unread,

Oct 14, 2012, 6:22:34 AM10/14/12

to redi...@googlegroups.com

BTW, is there a way to conclusively *prove* that this is due to HDD? I
see consistently 7-10 seconds per rewrite.

I already asked something along these lines recently here, but there
were no *concrete* answers (or I missed them). Sorry for lazyish
question, answers "read my post [link] again and do your homework" are
accepted (but if someone can give specifics, it will be much
appreciated).

Thanks,
Alexander.

Yiftach Shoolman

unread,

Oct 14, 2012, 7:05:37 AM10/14/12

to redi...@googlegroups.com

iostat -x <n seconds> and look at :

avgqu-sz - should be growing during rewrite time
await/w_await - should be growing during the rewrite time

Alexander.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--

Yiftach Shoolman
+972-54-7634621

M. Edward (Ed) Borasky

unread,

Oct 14, 2012, 11:44:48 AM10/14/12

to redi...@googlegroups.com

You might need to install the 'sysstat'package. IIRC it is not installed by default.

Twitter: http://twitter.com/znmeb; Computational Journalism Publishers Workbench: http://znmeb.github.com/Computational-Journalism-Publishers-Workbench/

How the Hell can the lion sleep with all those people singing "A weem oh way!" at the top of their lungs?

Josiah Carlson

unread,

Oct 15, 2012, 7:46:38 PM10/15/12

to redi...@googlegroups.com

How big is your AOF after rewrite? If it's about 1 gig, then you're
getting 100 megs/second write to disk. Unless you are using SSDs, RAID
0 on multiple disks, etc., 100 megs/second is not unreasonable.

- Josiah

Alexander Gladysh

unread,

Oct 16, 2012, 2:51:28 AM10/16/12

to redi...@googlegroups.com

On Sun, Oct 14, 2012 at 3:05 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:
>> BTW, is there a way to conclusively *prove* that this is due to HDD? I
>> see consistently 7-10 seconds per rewrite.

> iostat -x <n seconds> and look at :
>
> avgqu-sz - should be growing during rewrite time
> await/w_await - should be growing during the rewrite time

Thank you, will try it!

Alexander.

Alexander Gladysh

unread,

Oct 16, 2012, 3:07:09 AM10/16/12

to redi...@googlegroups.com

> How big is your AOF after rewrite? If it's about 1 gig, then you're
> getting 100 megs/second write to disk. Unless you are using SSDs, RAID
> 0 on multiple disks, etc., 100 megs/second is not unreasonable.

I see about 50MB/s, and my file is about 500 MB. Yep, my HDD is slow,
but even if it was 100MB/s — it is still not scalable.

Thank you for a good point, don't know how I missed the obvious.

But this confuses me even more.

I'm researching this whole problem because we see significant
performance degradation in our HTTP service that happens when Redis is
doing AOF rewrite.

Redis is under about 2K queries/s load. Commands are distributed more
or less as follows (ignore high number of PINGs, that's a workaround
for a bad persistent connection implementation):

797 "HINCRBY"
630 "PING"
525 "GET"
10 "MULTI"
10 "INCR"
10 "EXPIREAT"
10 "EXEC"
4 "SADD"
2 "SISMEMBER"

To deal with the problem we're planning to (a) split Redis to several
processes (each will handle about third of requests) and/or (b) move
persistence duty to a dedicated slave in a separate VM and disable AOF
at master.

But given this new insight... looks like option (a) wouldn't work —
we'd get three second slowdowns three times more often than ten second
slowdowns we have now, and it is almost just as bad for us.

This is quite unsettling.

Maybe I'm missing something? Why does Redis slows down so much while
it rewrites AOF? It does not sound right.

Please tell me if I can run any experiments to get more detailed view
on what is going on. I'd like to get a full understanding of a problem
instead of hiding behind workarounds.

Thanks,
Alexander.

Yiftach Shoolman

unread,

Oct 16, 2012, 3:39:23 AM10/16/12

to redi...@googlegroups.com

If your disk is slow during AOF re-write operation (again, you can prove that by looking at your iostat avgqu-sz and await/w_await params and see that they are growing), then there is a chance that your Redis hasn't completed its fsync operations within 2 secs and the entire Redis operation is blocked until fsync is completed.

See more details here

Thanks,
Alexander.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--

Yiftach Shoolman
+972-54-7634621

Alexander Gladysh

unread,

Oct 16, 2012, 3:54:15 AM10/16/12

to redi...@googlegroups.com

> If your disk is slow during AOF re-write operation (again, you can prove
> that by looking at your iostat avgqu-sz and await/w_await params and see
> that they are growing),

Will look at that ASAP and report back. But — if my disk write speed
is 50 MB/s and my AOF file size (when rewritten) is 500 MB, then I've
got 10 seconds write no matter what. Which is consistent with what I
see in logs. (See also below.)

> then there is a chance that your Redis hasn't
> completed its fsync operations within 2 secs and the entire Redis operation
> is blocked until fsync is completed.

> See more details here

Here is what I found:

> You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF
> file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the
> old AOF file with the new one.

Looks like I have to watch for the memory as well. But Redis has
plenty more available than AOF file would grow in that 10 seconds.

$ free -m
total used free shared buffers cached
Mem: 4012 2518 1493 0 141 1391
-/+ buffers/cache: 985 3026
Swap: 5362 0 5362

$ du -s /var/lib/redis/appendonly.aof; sleep 10; du -s
/var/lib/redis/appendonly.aof
827152 /var/lib/redis/appendonly.aof
827488 /var/lib/redis/appendonly.aof

But, I assume that you're pointing me at this quote:

> However if the disk can't cope with the write speed, and the background fsync(2) call is taking longer than 1 second,
> Redis may delay the write up to an additional second (in order to avoid that the write will block the main thread
> because of an fsync(2) running in the background thread against the same file descriptor). If a total of two seconds
> elapsed without that fsync(2) was able to terminate, Redis finally performs a (likely blocking) write(2) to transfer data
> to the disk at any cost.

So, the theory is that background fork doing AOF consumes all disk
write bandwidth, and parent Redis process simply can't flush its data.
Did I get that right?

OK, will experiment with iostat now, so we'd have some hard data to discuss.

Thanks,
Alexander.

Yiftach Shoolman

unread,

Oct 16, 2012, 4:29:18 AM10/16/12

to redi...@googlegroups.com

BTW - what is the value your no-appendfsync-on-rewrite ? (use config get *)

Thanks,
Alexander.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--

Yiftach Shoolman
+972-54-7634621

Alexander Gladysh

unread,

Oct 16, 2012, 4:30:40 AM10/16/12

to redi...@googlegroups.com

On Tue, Oct 16, 2012 at 12:29 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:
> BTW - what is the value your no-appendfsync-on-rewrite ? (use config get *)

$ redis-cli config get no-appendfsync-on-rewrite
1) "no-appendfsync-on-rewrite"
2) "no"

Alexander.

Alexander Gladysh

unread,

Oct 16, 2012, 4:38:05 AM10/16/12

to redi...@googlegroups.com

Found this option in redis.conf and read the accompanying comment. So,
looks like, my problem is known.

It keeps going on about "default Linux configuration". Maybe we can
tune something in the system to mitigate this?

Thanks,
Alexander.

Alexander Gladysh

unread,

Oct 16, 2012, 4:39:33 AM10/16/12

to redi...@googlegroups.com

P.S. Maybe http://redis.io/topics/latency should mention this option?

Alexander Gladysh

unread,

Oct 16, 2012, 6:48:18 AM10/16/12

to redi...@googlegroups.com

On Sun, Oct 14, 2012 at 3:05 PM, Yiftach Shoolman <yiftach....@gmail.com> wrote:
>> BTW, is there a way to conclusively *prove* that this is due to HDD? I
>> see consistently 7-10 seconds per rewrite.

> iostat -x <n seconds> and look at :
>
> avgqu-sz - should be growing during rewrite time
> await/w_await - should be growing during the rewrite time

Here we go.

I did two rewrites:

$ date && redis-cli bgrewriteaof
Tue Oct 16 14:00:06 MSK 2012

Background append only file rewriting started

$ date && redis-cli bgrewriteaof
Tue Oct 16 14:00:24 MSK 2012

Background append only file rewriting started

Here is Redis log:

[21686] 16 Oct 14:00:06 * Background append only file rewriting started by pid 8466
[21686] 16 Oct 14:00:10 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[8466] 16 Oct 14:00:18 * SYNC append only file rewrite performed
[21686] 16 Oct 14:00:18 * Background AOF rewrite terminated with success
[21686] 16 Oct 14:00:18 * Parent diff successfully flushed to the rewritten AOF (195076 bytes)
[21686] 16 Oct 14:00:18 * Background AOF rewrite successful

[21686] 16 Oct 14:00:25 * Background append only file rewriting started by pid 8571
[21686] 16 Oct 14:00:29 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[8571] 16 Oct 14:00:36 * SYNC append only file rewrite performed
[21686] 16 Oct 14:00:36 * Background AOF rewrite terminated with success
[21686] 16 Oct 14:00:36 * Parent diff successfully flushed to the rewritten AOF (274855 bytes)
[21686] 16 Oct 14:00:36 * Background AOF rewrite successful

First rewrite: 14:00:06–18

Second one: 14:00:25–36

Here is number of HTTP errors per second for that period (all are 502; good indication of slowdowns):

$ grep 14:00: MY.error.log | awk '{print $2}' | sort | uniq -c
74 14:00:11
121 14:00:12
125 14:00:13
120 14:00:14
131 14:00:15
123 14:00:16
93 14:00:17
6 14:00:18
62 14:00:30
110 14:00:31
114 14:00:32
103 14:00:33
90 14:00:34
44 14:00:35

Slowdown times as seen by Redis clients: 14:00:11–18 and 14:00:30–35.

Dstat measurements on VM with Redis (wow, my HDD is faster than I thought!):

Idle:


----system---- ----total-cpu-usage---- -dsk/total- -net/total- ------sockets------
     time     |usr sys idl wai hiq siq| read  writ| recv  send|tot tcp udp raw frg
16-10 13:59:17|  0   1  99   1   0   0|   0   176k| 109k   53k|198  38   3   0   0
16-10 13:59:18|  0   1  98   0   0   1|   0   136k| 131k   65k|198  38   3   0   0
16-10 13:59:19|  0   0  97   2   0   0|   0   152k| 109k   55k|198  38   3   0   0
16-10 13:59:20|  0   0  99   0   0   0|   0   128k| 124k   63k|198  38   3   0   0
16-10 13:59:21|  0   0 100   0   0   0|   0   144k| 120k   59k|198  38   3   0   0
16-10 13:59:22|  0   0  99   0   0   0|   0   144k| 108k   52k|198  38   3   0   0
16-10 13:59:23|  0   1  99   0   0   0|   0   136k| 116k   57k|198  38   3   0   0
16-10 13:59:24|  0   1  99   0   0   0|   0   144k| 114k   56k|198  38   3   0   0
16-10 13:59:25|  0   1  98   1   0   0|   0   144k| 128k   64k|198  38   3   0   0
16-10 13:59:26|  1   1  98   0   0   0|   0   144k| 108k   54k|198  38   3   0   0

First rewrite:


16-10 14:00:05|  0   0 100   0   0   0|   0   144k| 123k   63k|203  39   3   0   0
16-10 14:00:06|  0   0 100   0   0   0|   0   144k| 117k   60k|203  39   3   0   0
16-10 14:00:07| 20  13  66   0   0   0|   0   392k| 129k   67k|204  40   3   0   0
16-10 14:00:08| 23  10  45  22   0   0|8192B   59M| 138k   68k|204  40   3   0   0
16-10 14:00:09| 22  12   2  63   0   1|   0   159M| 115k   58k|204  40   3   0   0
16-10 14:00:10|  4   3   9  84   0   0|8192B  116M| 115k   61k|204  40   3   0   0
16-10 14:00:11|  0   1   4  95   0   0|   0   124M|  10k 6866B|204  40   3   0   0
16-10 14:00:12|  0   1   2  96   0   0|8192B  143M|2253B 3096B|204  40   3   0   0
16-10 14:00:13|  0   3   0  97   0   0|   0   125M|2142B 2744B|204  40   3   0   0
16-10 14:00:14|  0   2   4  94   0   0|8192B  132M|2271B 2916B|205  42   3   0   0
16-10 14:00:15|  0   1   7  92   0   0|   0   136M|1999B 2698B|205  42   3   0   0
16-10 14:00:16|  0   2   3  95   0   0|8192B   84M|2284B 3032B|205  42   3   0   0
16-10 14:00:17|  0   1   2  97   0   0|   0   151M|2288B 3049B|205  42   3   0   0
16-10 14:00:18|  0   1   4  95   0   0|8192B  133M|2341B 3090B|205  42   3   0   0
16-10 14:00:19|  0   7  89   3   0   0|   0  2152k| 254k  147k|203  39   3   0   0

Second rewrite:


16-10 14:00:24|  0   0  97   2   0   1|   0   184k| 133k   69k|203  39   3   0   0
16-10 14:00:25|  0   3  95   2   0   1|   0   184k| 100k   51k|205  41   3   0   0
16-10 14:00:26| 25  10  64   0   0   0|   0   200k| 144k   75k|204  40   3   0   0
16-10 14:00:27| 25  11  40  24   0   0|   0    98M| 104k   54k|204  40   3   0   0
16-10 14:00:28| 18  12  16  54   0   0|8192B  152M| 131k   69k|206  42   3   0   0
16-10 14:00:29|  4   4   4  88   0   1|   0   136M| 125k   61k|206  42   3   0   0
16-10 14:00:30|  0   0   0 100   0   0|8192B  130M|6282B 5352B|206  42   3   0   0
16-10 14:00:31|  0   0   0 100   0   0|   0   141M|2002B 2682B|206  42   3   0   0
16-10 14:00:32|  0   1   7  92   0   0|8192B  101M|2360B 3370B|206  42   3   0   0
16-10 14:00:33|  0   0  51  49   0   0|   0   145M|2431B 3149B|206  42   3   0   0
16-10 14:00:34|  0   1  51  48   0   0|8192B  123M|2285B 3048B|206  42   3   0   0
16-10 14:00:35|  0   1  51  48   0   0|   0   136M|1999B 2692B|206  42   3   0   0
16-10 14:00:36|  0   2  20  78   0   0|8192B  140M| 193k  112k|204  40   3   0   0
16-10 14:00:37|  0   3  15  81   0   0|   0    63M| 112k   56k|203  39   3   0   0

As we can see, slowdown begins two seconds after HDD gets saturated on write.

Iostat output is rather large, so I put it here: https://gist.github.com/81642c1f641a8c59961d

First rewrite:

avgqu-sz: jumps to 129.34 from zero at :08, peaks with 190.37 at :13, drops to 142.97 at :17 (and then back to zero).

w_await: jumps to 70.2 from zero at :08, peaks with 157.58 at :13 (fluctuating back and forth in the middle), drops to 80.83 at :17 (and then back to zero)

Second rewrite is more or less similar.

Thanks,

Alexander.

Alexander Gladysh

unread,

Oct 24, 2012, 4:29:45 AM10/24/12

to redi...@googlegroups.com

What is interesting is that we have two physical machines with similar
setup and similar load (in different Hetzner DCs). On one machine AOF
rewrite slowdowns give an order of magnitude higher rate of 502s.
We're looking for the difference, but so far were not able to find
anything significant...

Any advice on what to look for?

Thanks,
Alexander.

CharSyam

unread,

Oct 24, 2012, 4:34:29 AM10/24/12

to redi...@googlegroups.com

In my rule of thumb, The total memory size in your machine can affect.

2012/10/24 Alexander Gladysh <agla...@gmail.com>

Alexander Gladysh

unread,

Oct 24, 2012, 4:49:03 AM10/24/12

to redi...@googlegroups.com

On Wed, Oct 24, 2012 at 12:34 PM, CharSyam <char...@gmail.com> wrote:
>
> 2012/10/24 Alexander Gladysh <agla...@gmail.com>
>>
>> What is interesting is that we have two physical machines with similar
>> setup and similar load (in different Hetzner DCs). On one machine AOF
>> rewrite slowdowns give an order of magnitude higher rate of 502s.
>> We're looking for the difference, but so far were not able to find
>> anything significant...
>>
>> Any advice on what to look for?

> In my rule of thumb, The total memory size in your machine can affect.

Thank you, but the memory and VM layout in general is identical as far
as we can tell.

I suspect either a hardware problem, or some obscure Xen or VM kernel
config option that got diverged somehow... (unfortunately, these
machines were set up by manually, so subtle differences are possible).

Alexander.

Reply all

Reply to author

Forward