Monitoring slaves

1,476 views
Skip to first unread message

Adi Chiru

unread,
Jun 2, 2014, 10:19:50 PM6/2/14
to redis-db
Hello,

We have a quite a few redis instances and I just added slaves to most
of them. To do this for some of the biggest and busier ones however
was a little tricky and now I need to monitor how well the slaves are
doing relative to their master.

I did read the docs and search this mailing list and search the web
and read a few blogs and so on but there does not seem to be a clear
answer to this.

How do you do it?

Also, if anyone can explain what these mean, I'd really appreciate it:

1. What is "repl_backlog_active" ?
2. What is the "lag" in this line:
slave0:ip=10.144.178.126,port=6379,state=online,offset=83052571679,lag=1
and what is its measurement unit?
3. What is the "offset" in the above line and how does this relates to
master_repl_offset value?
4. Is min-slaves-max-lag realted to the lag value in the "slaveN:"
line in replication?


Thanks for help!
Adi

Josiah Carlson

unread,
Jun 3, 2014, 2:00:25 AM6/3/14
to redi...@googlegroups.com
Meant to reply to your other thread, but got busy. So here you go.

On Mon, Jun 2, 2014 at 7:19 PM, Adi Chiru <adic...@gmail.com> wrote:
Hello,

We have a quite a few redis instances and I just added slaves to most
of them. To do this for some of the biggest and busier ones however
was a little tricky and now I need to monitor how well the slaves are
doing relative to their master.

I did read the docs and search this mailing list and search the web
and read a few blogs and so on but there does not seem to be a clear
answer to this.

How do you do it?

Also, if anyone can explain what these mean, I'd really appreciate it:

1. What is "repl_backlog_active" ?

According to the source, it is actually the result of a comparison operation: repl_backlog != NULL, which was printf'd via printf('%d', server.repl_backlog != NULL) , according to my memory. The boolean result is being cast to a 32 bit signed integer, which has the apparent value "14". That doesn't *seem* to be a messed up compiler optimization (though could be), if only because an actual offset in a structure would likely be 4 or 8-byte aligned, and this is only 2-byte aligned.
 
2. What is the "lag" in this line:
slave0:ip=10.144.178.126,port=6379,state=online,offset=83052571679,lag=1
and what is its measurement unit?

It is the last time that the slave sent an "ack" back. I believe this represents 1 second.
 
3. What is the "offset" in the above line and how does this relates to
master_repl_offset value?

Offset represents how many bytes from the beginning of ... something (maybe it counts the number of bytes of commands it has processed. The replication backlog keeps a buffer with offsets. This is the particular byte offset is interesting, but even more interesting with the *other* lines around it. From the other thread:

slave0:ip=10.144.178.126,port=6379,state=online,offset=83052571679,lag=1
master_repl_offset:83809730520
repl_backlog_active:18
repl_backlog_size:1048576000e
repl_backlog_first_byte_offset:82761154521

Okay, the repl_backlog_size (1048576000 byte) buffer starts at repl_backlog_first_byte_offset (82761154521), the master has executed everything up to master_repl_offset (83809730520), and the slave is at slave0.offset (83052571679). Your slave is 83809730520 - 83052571679 = 757,158,841 bytes behind.

Either your slave isn't fast enough, your network isn't fast enough, or a combination of the two. If you could find a compressed socket tunneler that did LZ4 on the line, you could get about a 50-70% savings on bandwidth for about 5-10% of one CPU per Redis server. No joke. I've got a prototype I hacked together in Python using straight zlib (at compression level 5) that cuts bandwidth usage for many of my use-cases by a factor of 5-7x at about 15% CPU. The only reason I haven't released it is because the Python ssl library is a bit funky with partial sends.

I should just release it without SSL support.
 
4. Is min-slaves-max-lag realted to the lag value in the "slaveN:"
line in replication?

Sort of. If your server has more than 1 slave, it gets assigned a slave ID. Since you only have one slave, he is called slave0.

This particular setting, as the config file describes, limits your potential loss of data. Like "up to 10 seconds" as is in the conf file. That's not too horrible.

 - Josiah

Thanks for help!
Adi

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Adi Chiru

unread,
Jun 3, 2014, 8:25:51 PM6/3/14
to redis-db
Thanks Josiah!

Very useful information.

Regarding the on-line compression of data, SSH tunnels setup using v2
are using compression by default (with zlib) at level 6 (the
compression level is not configurable) so my question is why not using
SSH for these tunnels?

I will do some tests myself as soon as I can.

Thanks,
Adi

Adi Chiru

unread,
Jun 3, 2014, 8:58:54 PM6/3/14
to redis-db
And one more question please:

One master I get this - note the master_repl_offset please
root@redis2:/home/ubuntu# redis-cli info replication
# Replication
role:master
connected_slaves:1
slave0:ip=10.144.178.126,port=6379,state=online,offset=1465619630555,lag=0
master_repl_offset:1466376713539
repl_backlog_active:1
repl_backlog_size:1048576000
repl_backlog_first_byte_offset:1465328137540
repl_backlog_histlen:1048576000

On its slave I get this - again, note the master_repl_offset:

root@redis2:/home/ubuntu# redis-cli -h 10.144.178.126 -p 6379 info replication
# Replication
role:slave
master_host:redis2.freelancer.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1465808387214
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

So, on master the master_repl_offset is 1466376713539 while on its
slave it is 0 - commands have been run less than 1 sec apart.

Is this normal? Representing maybe same info from 2 different points of view?

I am trying to get the number of bytes the slave is behind its master
in a python script to be run for monitoring purposes against the slave
but it seems I can't do this.... any other ideas on how to do this
reliably?

Thanks,
Adi

On Tue, Jun 3, 2014 at 4:00 PM, Josiah Carlson <josiah....@gmail.com> wrote:

Josiah Carlson

unread,
Jun 3, 2014, 11:50:57 PM6/3/14
to redi...@googlegroups.com
On Tue, Jun 3, 2014 at 5:25 PM, Adi Chiru <adic...@gmail.com> wrote:
Thanks Josiah!

Very useful information.

Regarding the on-line compression of data, SSH tunnels setup using v2
are using compression by default (with zlib) at level 6 (the
compression level is not configurable) so my question is why not using
SSH for these tunnels?

You tried that already during heavy load, and it didn't work, right? That would suggest that zlib at compression level 6 might not be fast enough to handle compressing the 6 gig snapshot + the continuing replication stream before the backlog gets full. It could also mean that your system is CPU-bound, and ssh can't get enough CPU to compress the data fast enough.

I will do some tests myself as soon as I can.

Yes, run some tests, and let us know what the problem is. Compression performance, CPU use, something else that I haven't considered... That will at least give us ideas to help pass along to the next guy :)

 - Josiah

Josiah Carlson

unread,
Jun 4, 2014, 12:00:19 AM6/4/14
to redi...@googlegroups.com
On the slave, you have to understand that all "master"-related information is related to the slave being a master itself. Why? Because all Redis servers, including slaves, can also be masters. You can create slave replication chains and trees. Note that you should never create a loop - loops are a great way to heavily load your system, saturate your network, and lose data. Yes, all three. It's nasty. And somewhat entertaining if you are into that kind of thing.

I am trying to get the number of bytes the slave is behind its master
in a python script to be run for monitoring purposes against the slave
but it seems I can't do this.... any other ideas on how to do this
reliably?

The slave tells it to you itself, "slave_repl_offset:1465808387214" on the slave represents (in this case), slave0.offset:1465619630555 on the master. If I remember correctly, the number on the slave represents the total bytes received and processed, and on the master it I believe it represents how many bytes have been acked by the slave. If you are looking to monitor the slave, just pull slave_repl_offset from the slave, and you are set :)

 - Josiah

Adi Chiru

unread,
Jun 4, 2014, 1:14:18 AM6/4/14
to redis-db
In line reply:
Makes total sense now that you reminded me of the possibility to link slaves.


>
>> I am trying to get the number of bytes the slave is behind its master
>> in a python script to be run for monitoring purposes against the slave
>> but it seems I can't do this.... any other ideas on how to do this
>> reliably?
>
>
> The slave tells it to you itself, "slave_repl_offset:1465808387214" on the
> slave represents (in this case), slave0.offset:1465619630555 on the master.
> If I remember correctly, the number on the slave represents the total bytes
> received and processed, and on the master it I believe it represents how
> many bytes have been acked by the slave. If you are looking to monitor the
> slave, just pull slave_repl_offset from the slave, and you are set :)

So the number of bytes behind master is the value of "slave_repl_offset" ?
It doesn't make sense to me or I am failing to understand something - that value is constantly increasing... even for the slaves attached to very small and not busy at all masters.

Josiah Carlson

unread,
Jun 4, 2014, 7:18:51 PM6/4/14
to redi...@googlegroups.com
The number is *progress*. So on the master, you have "master_repl_offset", that tells you how far along the master is. And on the slave, you have "slave_repl_offset", which tells you how far along the slave is. The slave data is also on the master as the "offset" field of the appropriate "slaveX" entry. How far behind the slave is can be calculated by the master-side master_repl_offset - slave_repl_offset on the slave (or equivalently), master_repl_offset - slaveX.offset.

The tough part is that the slave has no idea how much total data the master has, so you can't just monitor the slave to know how far behind he is. You *must* query the master, but if you are querying the master, you may as well calculate how far behind *all* of your slaves are (which you can differentiate based on the IP address listed for each slave).

 - Josiah

Adi Chiru

unread,
Jun 5, 2014, 9:06:27 AM6/5/14
to redis-db
Makes total sense now and thank you clarified that Josiah.

I have a python script doing all the monitoring for redis servers, master or slave (even in read-only mode) and replication.
I think it is a good thing if we could add these details to the documentation and the script as an example. It this possible?

Also, now that I can monitor the  replication, I will start testing the compression tunnel for master-slave communication and come back with details.

Thanks,
Adi

Josiah Carlson

unread,
Jun 5, 2014, 2:15:54 PM6/5/14
to redi...@googlegroups.com
On Thu, Jun 5, 2014 at 6:06 AM, Adi Chiru <adic...@gmail.com> wrote:
Makes total sense now and thank you clarified that Josiah.

I have a python script doing all the monitoring for redis servers, master or slave (even in read-only mode) and replication.
I think it is a good thing if we could add these details to the documentation and the script as an example. It this possible?

Send a pull request for the docs :)

On the monitoring side of things, many existing monitoring systems (munin, nagios, others...) already have plugins that know how to interpret how far behind Redis slaves are. Which system are you using?

Also, now that I can monitor the  replication, I will start testing the compression tunnel for master-slave communication and come back with details.

Awesome :)

Adi Chiru

unread,
Jun 5, 2014, 7:23:55 PM6/5/14
to redis-db
On Fri, Jun 6, 2014 at 4:15 AM, Josiah Carlson <josiah....@gmail.com> wrote:

On Thu, Jun 5, 2014 at 6:06 AM, Adi Chiru <adic...@gmail.com> wrote:
Makes total sense now and thank you clarified that Josiah.

I have a python script doing all the monitoring for redis servers, master or slave (even in read-only mode) and replication.
I think it is a good thing if we could add these details to the documentation and the script as an example. It this possible?

Send a pull request for the docs :)

On the monitoring side of things, many existing monitoring systems (munin, nagios, others...) already have plugins that know how to interpret how far behind Redis slaves are. Which system are you using?
I am currently using Nagios but all I could find for this were perl scripts (we need bash or python here) and none of them actually check for how many bytes the slave is behind the master which, for us, is the best info to look into.

Reply all
Reply to author
Forward
0 new messages