appendonly + bgsave is locking up the redis-server process... thoughts?

354 views
Skip to first unread message

Ty

unread,
Jul 1, 2015, 10:26:32 AM7/1/15
to redi...@googlegroups.com
We have redis 3.0.2 running on an r3.xlarge EC2 instance currently using 16.4G of memory. We use both AOF and RDB persistence. The drive setup is a ZFS pool consisting of an 80 GB magnetic EBS volume for storage, and then the ephemeral SSD used as ZIL and L2ARC to improve throughput, cache reads, and provide a write log to make synchronous writes go quicker. 

In this setup, when a BGSAVE is initiated, things are ok for a bit, but then it appears that the AOF file can't be appended to quickly enough and redis blocks except in spurts. If I 'config set appendonly no' then redis runs fine during the BGSAVE.

I thought that the ZFS ZIL would solve this problem by committing the synchronous write while the EBS volume was busy handling the BGSAVE but apparently no dice. Here's our zpool: sdb is the EBS volume, sdc is the ephemeral SSD

[root@ip-10-171-132-86 redis]# zpool list storage

NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT

storage  79.5G  12.9G  66.6G         -    38%    16%  1.00x  ONLINE  -


[root@ip-10-171-132-86 redis]# zpool status storage

  pool: storage

 state: ONLINE

  scan: none requested

config:


NAME        STATE     READ WRITE CKSUM

storage     ONLINE       0     0     0

  sdb       ONLINE       0     0     0

logs

  sdc1      ONLINE       0     0     0

cache

  sdc2      ONLINE       0     0     0


errors: No known data errors


And zpool iostat during a BGSAVE with AOF enabled:

               capacity     operations    bandwidth

pool        alloc   free   read  write   read  write

----------  -----  -----  -----  -----  -----  -----

storage     14.9G  64.6G      0    115      0  14.4M

  sdb       14.9G  64.6G      0    114      0  14.4M

logs            -      -      -      -      -      -

  sdc1      4.50M  9.93G      0      0      0  69.3K

cache           -      -      -      -      -      -

  sdc2      63.2G  1.78G      0    280      0  34.8M

----------  -----  -----  -----  -----  -----  -----


What else can we do here to support both AOF and RDB persistence? I'd like to keep both... RDB because we send that hourly to S3 as our off-server backup, and AOF because as long as the EBS volume survives whatever calamity takes the server out, we can get back up with no data loss.


FWIW, we used to have the same setup without the ZFS filesystem, just attempting to use a SSD EBS volume.  Worked fine for a while, but eventually all the IO exceeded the IOPS they allot to the volume and things got very bad when it throttled down to the 'baseline' amount of IO. We'd prefer not to pay for a provisioned IOPS volume if possible, since that gets very expensive very quickly. Even still it would give us higher latency while BGSAVE was happening, presumably because of disk contention for the AOF.

Thanks,
-Ty

Ty

unread,
Jul 1, 2015, 10:36:18 AM7/1/15
to redi...@googlegroups.com
some further info...

I tested using `appendfsync no` which helped, but there was still a period of about 10 seconds during the BGSAVE (appeared to be around the middle of its creation... 2.2 GB was written to disk out of the eventual 4 GB according to ls -lah) that requests went unanswered.

Josiah Carlson

unread,
Jul 1, 2015, 3:08:59 PM7/1/15
to redi...@googlegroups.com
This is what is going on:
1. When a BGSAVE starts, a child process is forked from the master
2. This child process then serializes the data into chunks and writes it to disk (EBS in your case)
3. Upon finishing serializing all of the data, Redis fsyncs the newly written snapshot
4. After fsyncing the snapshot, Redis renames the snapshot to whatever the name is in the config

The problem:
* EBS is not a disk, EBS is a service that provides block-level storage (Elastic Block Storage) on the other side of a network connection
* When #2 is going on, commands compete with snapshotting
* When #3 happens, the OS prioritizes sending data to EBS at the expense of regular network traffic, preventing regular commands from getting through

The solution:
* Don't use EBS to create or store your snapshots - write them to a local SSD or spinning disk
* If you need snapshots persisted somewhere else, add slaves and/or add something that automatically backs up updated files (inotify + s3 backup works pretty well)

Note: AOF will have the same problem when it gets rewritten as it gets too big.

 - Josiah


On Wed, Jul 1, 2015 at 7:36 AM, Ty <ty.wan...@gmail.com> wrote:
some further info...

I tested using `appendfsync no` which helped, but there was still a period of about 10 seconds during the BGSAVE (appeared to be around the middle of its creation... 2.2 GB was written to disk out of the eventual 4 GB according to ls -lah) that requests went unanswered.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Ty

unread,
Jul 1, 2015, 4:07:22 PM7/1/15
to redi...@googlegroups.com
That makes sense. I'll have to check the network throughput next time I play with things to see if network contention is the real culprit. I suspect you're probably right.

Is there any way to have the OS prioritize the parent process over the child fork? I don't really care if the BGSAVE takes an extra minute if it means that the regular commands get run quickly.

-Ty

Josiah Carlson

unread,
Jul 1, 2015, 4:20:40 PM7/1/15
to redi...@googlegroups.com
You can try to renice the child process, but I'm not sure that matters when push comes to shove as the kernel is writing blocks to the "block device".

 - Josiah

Greg Andrews

unread,
Jul 1, 2015, 4:55:40 PM7/1/15
to redi...@googlegroups.com

Ty <ty.wan...@gmail.com> wrote:
Is there any way to have the OS prioritize the parent process over the child fork? I don't really care if the BGSAVE takes an extra minute if it means that the regular commands get run quickly.

There's ionice, though I don't think there's an entry point for Redis to ask the kernel to change the child process's i/o scheduling priority without hacking the Redis source code.  I'm not aware of a kernel i/o configuration that would automatically handle a child process differently than the parent.

It kinda looks like you have two options: Attach a filesystem that's backed by local disk or SSD (as Josiah described), or spin up another server to be a slave Redis instance, and move your snapshots and S3 uploads to the slave.

One thing to remember about the slave option is that a full sync that puts the data into the slave is effectively a BGSAVE operation.  However, that should be a rare occurrence, not an hourly event.

  -Greg

Reply all
Reply to author
Forward
0 new messages