aws autoscaling and redis replication

280 views
Skip to first unread message

Tony Lambropoulos

unread,
Apr 17, 2014, 10:46:54 PM4/17/14
to redi...@googlegroups.com
I have an autoscaling group that only reads from our redis database.  To keep the latency as low as possible, I'd like to have a slave on each of these instances.  My question is, when does it become inefficient to have a slave per server in an autoscaling group and would that play nicely with instances constantly going up and being terminated?  My AS (autoscaling) group currently has about 50 instances.  Also, how taxing is replication on each of the servers in the AS group?

Josiah Carlson

unread,
Apr 18, 2014, 11:11:51 AM4/18/14
to redi...@googlegroups.com
Every time your servers scale up, Redis will add a new slave. Unless there is already a slave that has recently connected, each new slave induces a BGSAVE + a full transfer of the dumped database to the slave. From there, all write commands are replicated out to all the slaves.

Potential issues:
* If your autoscale up occurs quickly, your master may not be able to keep up with sending the dump files to new slaves, all writes to the master can slow, and replication delays can happen to the slaves.
* 50 slaves is a lot of slaves, especially if you have even modest write volume. Understand that for every byte that is written to the master, 50 bytes need to be transferred to your slaves.

Potential solutions:
* You may be able to address some of the data transfer issues by using an SSH tunnel with compression enabled: http://tech.3scale.net/2012/07/25/fun-with-redis-replication/
* Another option is that you can set up an intermediate level that initially slaves from the master and your autoscale group slaves from them... but that can be a lot of work to make it all correct.
* Unless you have extraordinary read volume, it is very unlikely that you will actually need a slave on every autoscaled server. Instead, you could consider a smaller set of read slaves, say 5-10 or so, and you can hit them directly from each autoscaled server.

Do you have evidence that you need the read performance of running a slave on every server? If not, I'd try something else.

 - Josiah



On Thu, Apr 17, 2014 at 7:46 PM, Tony Lambropoulos <tony...@gmail.com> wrote:
I have an autoscaling group that only reads from our redis database.  To keep the latency as low as possible, I'd like to have a slave on each of these instances.  My question is, when does it become inefficient to have a slave per server in an autoscaling group and would that play nicely with instances constantly going up and being terminated?  My AS (autoscaling) group currently has about 50 instances.  Also, how taxing is replication on each of the servers in the AS group?

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Greg Andrews

unread,
Apr 18, 2014, 12:24:20 PM4/18/14
to redi...@googlegroups.com
I'd like to expand on part of Josiah's answer:

Josiah wrote:
Every time your servers scale up, Redis will add a new slave. Unless there is already a slave that has recently connected, each new slave induces a BGSAVE + a full transfer of the dumped database to the slave. From there, all write commands are replicated out to all the slaves.


From the redis.io documentation page on Replication, under the heading "How Redis replication works":

If you set up a slave, upon connection it sends a SYNC command. It doesn't matter if it's the first time it has connected or if it's a reconnection.

The master then starts background saving, and starts to buffer all new commands received that will modify the dataset. When the background saving is complete, the master transfers the database file to the slave, which saves it on disk, and then loads it into memory. The master will then send to the slave all buffered commands.


So in detail, after the new slave has connected and issued the SYNC command, the sequence is:
  1. The master writes the full dataset from RAM to disk (the BGSAVE that Josiah mentioned)
  2. The master transfers the full dataset from disk to the new slave, which saves it to disk
  3. The new slave loads the full dataset from disk to RAM
  4. The new slave reads, as fast as possible, the updates that the master buffered in RAM during steps 2 and 3
  5. When the new slave has caught up with the buffered updates, the replication stream slows down to the normal rate of updates
If multiple slaves connect and issue SYNC commands at the same time (i.e., during Step 1), the master will re-use the same on-disk dump file to save disk space.

Notice how Steps 1-3 make heavy use of disk.  In Step 1, Redis writes to disk on the master as fast as it can[*].  In Step 2, Redis reads from disk on the master as fast as it can, and writes to disk on the slave as fast as it can.  In Step 3, Redis reads from disk on the slave as fast as it can.

If there are multiple slaves syncing from the same master in Step 2, the master's disk I/O can saturate, slowing down the transfers to the slaves.  Meanwhile the master's RAM consumption grows because more updates are added to the buffer (this is the buffer of updates mentioned in Step 4).

If the master and slaves have sufficient disk I/O to keep these writes and reads fast, then you won't have problems.  However, disk I/O is often a weak part of cloud systems, so test your idea thoroughly.


[*] For an idea of how easy it is to saturate disk I/O when saving from RAM or loading to RAM, see this post which illustrates the difference in speeds between RAM, SSD, and HD.

  -Greg

Tony Lambropoulos

unread,
Apr 18, 2014, 2:30:33 PM4/18/14
to redi...@googlegroups.com
Hey Guys,
First, appreciate the in depth explanations on the hurdles I'll face.  The issues you've mentioned seem like they could be problematic especially if/when the AS group gets larger (might not be the most scalable solution?).  
To give a little more context:  
Im seeing about 5000 reads per second, 2000 writes per second and have 2 replicas (this is for one of our redis DBs, we have several).  This approach is working relatively ok (connections don't time out, data is retreived successfully, etc..), but the real issue is the latency.  In our application, There are very hard time constraints and a lot of request volume.  Reading from a replica even on the same internal AWS network seems to have inconsistent enough latency that it begins to put our application's response time over the time constraints we have.

Perhaps first I will try Josiah's practical approach of throwing more slaves in and seeing if that alleviates the inconsistent response times (probably a good idea because network latency is likely pretty consistent? need to test)?

Anyway, if there are any other suggestions, I'm all ears.

Thanks,
Tony

Josiah Carlson

unread,
Apr 18, 2014, 4:07:56 PM4/18/14
to redi...@googlegroups.com
How short are your deadlines? How much total data do you have? How much data churn do you have? How sensitive is your app to stale data?

 - Josiah


--

Tony Lambropoulos

unread,
Apr 18, 2014, 4:26:13 PM4/18/14
to redi...@googlegroups.com
How short are your deadlines?
Within about a week, we'd like at least a workable improved solution but the project can be ongoing
How much total data do you have?
Our biggest database has an upper limit of 10 million or so (may increase as we scale)

How much data churn do you have?
 I assume this means how often is data deleted.  All data is useless after 24 hours of the first time we first insert a key, then its erased.

How sensitive is your app to stale data?
We basically increment a count for each of these keys, but our reads dont need to be perfect.  In short, we can have somewhat stale data (a count of 3, and reading a value of 2 is OK, but having an actual count of 10, and reading 2 is not OK)

Also, I've read a few of your articles (one on bloom filters, one on why you didn't use bloom filters) and have tried to think of ways to apply those but haven't thought of anything that fits really well.  If you want more details, etc... I'd be more than willing to chat over google+ or what have you.

Thanks,
Tony 

Josiah Carlson

unread,
Apr 18, 2014, 5:15:25 PM4/18/14
to redi...@googlegroups.com
On Fri, Apr 18, 2014 at 1:26 PM, Tony Lambropoulos <tony...@gmail.com> wrote:
How short are your deadlines?
Within about a week, we'd like at least a workable improved solution but the project can be ongoing

I actually meant the commands that you are sending to Redis. How quickly do your commands have to return? And what commands do you call to read data?
 
How much total data do you have?
Our biggest database has an upper limit of 10 million or so (may increase as we scale)

Another misunderstanding. How much ram does it use?

How much data churn do you have?
 I assume this means how often is data deleted.  All data is useless after 24 hours of the first time we first insert a key, then its erased.

You gave me what I need. You are doing daily counters. There might be ways of reducing your memory use depending on how you structure your keys. That could reduce the time to fork() for snapshots, and could give you smaller snapshots for slaves.

How sensitive is your app to stale data?
We basically increment a count for each of these keys, but our reads dont need to be perfect.  In short, we can have somewhat stale data (a count of 3, and reading a value of 2 is OK, but having an actual count of 10, and reading 2 is not OK)

So maybe a few minute delay is cool, but a couple hours is right out. I had considered that it might be viable to do a periodic snapshot followed by an upload to S3, followed by clients pulling down the snapshot, pausing the web servers, restarting Redis with the new dump, then unpausing the web servers. This does not seem viable given the variance requirements.

Also, I've read a few of your articles (one on bloom filters, one on why you didn't use bloom filters) and have tried to think of ways to apply those but haven't thought of anything that fits really well.  If you want more details, etc... I'd be more than willing to chat over google+ or what have you.

I usually like to keep it public unless someone is looking to pay me ;)

Possible "easy" solution: use bigger instances for your autoscale group to try to get your instance count to 5-10 at the most. Then you can put a slave on every server and get almost exactly what you want. Your write volume suggests that this should still be pretty viable as long as random disconnects are rare. The biggest obvious drawback is that the incremental cost for running another server is significantly larger.

Another option is to set up some high-IO slaves of your master (say 5-10) that are configured in DNS to be returned in a round-robin fashion, and have your autoscale slaves connect to one of the servers based on that DNS entry. That should get you a tree-like structure for handling IO write scaling, slaving with snapshotting, etc., but without a lot of the management.

 - Josiah

Tony Lambropoulos

unread,
Apr 18, 2014, 6:00:20 PM4/18/14
to redi...@googlegroups.com
how short are your deadlines?
yeah, that makes a lot more sense.  We'd like to keep it under or around 2 milliseconds.  We're running hgets for reads (which can probably be turned into an hgetall actually) and hsets for writes.  Those are the only two commands we run

How much ram does it use?
            biggest DB peaks at ~2 gigs (usually much lower though).  The 3-4 others are only around ~200 mbs.

Your last idea sounds really promising, as it'll alleviate the master during replication even if the AS group scales up and the databases grow (tree-structure as you said).  What do you mean about "IO write scaling" though?

Thanks,
Tony

Josiah Carlson

unread,
Apr 18, 2014, 11:55:23 PM4/18/14
to redi...@googlegroups.com
On Fri, Apr 18, 2014 at 3:00 PM, Tony Lambropoulos <tony...@gmail.com> wrote:
how short are your deadlines?
yeah, that makes a lot more sense.  We'd like to keep it under or around 2 milliseconds.  We're running hgets for reads (which can probably be turned into an hgetall actually) and hsets for writes.  Those are the only two commands we run

Okay, you are probably already doing the right thing memory-wise, so you probably can't save a huge amount there.

How much ram does it use?
            biggest DB peaks at ~2 gigs (usually much lower though).  The 3-4 others are only around ~200 mbs.

That's not horribly huge. I've found that for most types of data, I will usually see the snapshot taking between 1/20th and 1/2 the size of the in-memory data size, depending on the types of data used. I'm a bit less concerned about new slaves given this data size, but I would still add intermediate nodes for slaving from the master.

Your last idea sounds really promising, as it'll alleviate the master during replication even if the AS group scales up and the databases grow (tree-structure as you said).  What do you mean about "IO write scaling" though?

Short version: every slave gets every write from the master distributed to it. Have 50 slaves? You've got 50 copies of every write to the master going out. That's substantial write amplification, and without something to take up the write IO scaling issue (because there is one), you're going to have a bad time.
Reply all
Reply to author
Forward
0 new messages