appendonly + bgsave is locking up the redis-server process... thoughts?

Ty

unread,

Jul 1, 2015, 10:26:32 AM7/1/15

to redi...@googlegroups.com

We have redis 3.0.2 running on an r3.xlarge EC2 instance currently using 16.4G of memory. We use both AOF and RDB persistence. The drive setup is a ZFS pool consisting of an 80 GB magnetic EBS volume for storage, and then the ephemeral SSD used as ZIL and L2ARC to improve throughput, cache reads, and provide a write log to make synchronous writes go quicker.

In this setup, when a BGSAVE is initiated, things are ok for a bit, but then it appears that the AOF file can't be appended to quickly enough and redis blocks except in spurts. If I 'config set appendonly no' then redis runs fine during the BGSAVE.

I thought that the ZFS ZIL would solve this problem by committing the synchronous write while the EBS volume was busy handling the BGSAVE but apparently no dice. Here's our zpool: sdb is the EBS volume, sdc is the ephemeral SSD

[root@ip-10-171-132-86 redis]# zpool list storage

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT

storage 79.5G 12.9G 66.6G - 38% 16% 1.00x ONLINE -

[root@ip-10-171-132-86 redis]# zpool status storage

pool: storage

state: ONLINE

scan: none requested

config:

NAME STATE READ WRITE CKSUM

storage ONLINE 0 0 0

sdb ONLINE 0 0 0

logs

sdc1 ONLINE 0 0 0

cache

sdc2 ONLINE 0 0 0

errors: No known data errors

And zpool iostat during a BGSAVE with AOF enabled:

capacity operations bandwidth

pool alloc free read write read write

---------- ----- ----- ----- ----- ----- -----

storage 14.9G 64.6G 0 115 0 14.4M

sdb 14.9G 64.6G 0 114 0 14.4M

logs - - - - - -

sdc1 4.50M 9.93G 0 0 0 69.3K

cache - - - - - -

sdc2 63.2G 1.78G 0 280 0 34.8M

---------- ----- ----- ----- ----- ----- -----

What else can we do here to support both AOF and RDB persistence? I'd like to keep both... RDB because we send that hourly to S3 as our off-server backup, and AOF because as long as the EBS volume survives whatever calamity takes the server out, we can get back up with no data loss.

FWIW, we used to have the same setup without the ZFS filesystem, just attempting to use a SSD EBS volume. Worked fine for a while, but eventually all the IO exceeded the IOPS they allot to the volume and things got very bad when it throttled down to the 'baseline' amount of IO. We'd prefer not to pay for a provisioned IOPS volume if possible, since that gets very expensive very quickly. Even still it would give us higher latency while BGSAVE was happening, presumably because of disk contention for the AOF.

Thanks,

-Ty

Ty

unread,

Jul 1, 2015, 10:36:18 AM7/1/15

to redi...@googlegroups.com

some further info...

I tested using `appendfsync no` which helped, but there was still a period of about 10 seconds during the BGSAVE (appeared to be around the middle of its creation... 2.2 GB was written to disk out of the eventual 4 GB according to ls -lah) that requests went unanswered.

Josiah Carlson

unread,

Jul 1, 2015, 3:08:59 PM7/1/15

to redi...@googlegroups.com

This is what is going on:

1. When a BGSAVE starts, a child process is forked from the master

2. This child process then serializes the data into chunks and writes it to disk (EBS in your case)

3. Upon finishing serializing all of the data, Redis fsyncs the newly written snapshot

4. After fsyncing the snapshot, Redis renames the snapshot to whatever the name is in the config

The problem:

* EBS is not a disk, EBS is a service that provides block-level storage (Elastic Block Storage) on the other side of a network connection

* When #2 is going on, commands compete with snapshotting

* When #3 happens, the OS prioritizes sending data to EBS at the expense of regular network traffic, preventing regular commands from getting through

The solution:

* Don't use EBS to create or store your snapshots - write them to a local SSD or spinning disk

* If you need snapshots persisted somewhere else, add slaves and/or add something that automatically backs up updated files (inotify + s3 backup works pretty well)

Note: AOF will have the same problem when it gets rewritten as it gets too big.

- Josiah

On Wed, Jul 1, 2015 at 7:36 AM, Ty <ty.wan...@gmail.com> wrote:

some further info...

I tested using `appendfsync no` which helped, but there was still a period of about 10 seconds during the BGSAVE (appeared to be around the middle of its creation... 2.2 GB was written to disk out of the eventual 4 GB according to ls -lah) that requests went unanswered.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Ty

unread,

Jul 1, 2015, 4:07:22 PM7/1/15

to redi...@googlegroups.com

That makes sense. I'll have to check the network throughput next time I play with things to see if network contention is the real culprit. I suspect you're probably right.

Is there any way to have the OS prioritize the parent process over the child fork? I don't really care if the BGSAVE takes an extra minute if it means that the regular commands get run quickly.

-Ty

Josiah Carlson

unread,

Jul 1, 2015, 4:20:40 PM7/1/15

to redi...@googlegroups.com

You can try to renice the child process, but I'm not sure that matters when push comes to shove as the kernel is writing blocks to the "block device".

- Josiah

Greg Andrews

unread,

Jul 1, 2015, 4:55:40 PM7/1/15

to redi...@googlegroups.com

Ty <ty.wan...@gmail.com> wrote:

Is there any way to have the OS prioritize the parent process over the child fork? I don't really care if the BGSAVE takes an extra minute if it means that the regular commands get run quickly.

There's ionice, though I don't think there's an entry point for Redis to ask the kernel to change the child process's i/o scheduling priority without hacking the Redis source code. I'm not aware of a kernel i/o configuration that would automatically handle a child process differently than the parent.

It kinda looks like you have two options: Attach a filesystem that's backed by local disk or SSD (as Josiah described), or spin up another server to be a slave Redis instance, and move your snapshots and S3 uploads to the slave.

One thing to remember about the slave option is that a full sync that puts the data into the slave is effectively a BGSAVE operation. However, that should be a rare occurrence, not an hourly event.

-Greg

Reply all

Reply to author

Forward