Hi Sergey,
The cluster info script will be quite lightweight, so you can run it every minute on the instances where you are planning to run sentinel instead.
so they will act as notifications for you in case the cluster state is not ok. isn't this what you are planning to achieve using a set of sentinels which just issue alerts?
Also, i have only worked with sentinels only in 2.8.9, was not aware of the notification script which you mentioned in sentinel, and i think it would be easier to have a monitoring script which could monitor the servers as per our requirements, instead of altering the behaviour of sentinels.
As per my limited knowledge of sentinels, they score over clusters only if we want to use redis for pub-sub, because pub-sub in cluster is not recommended.
what we have is a few large cluster groups, and a master slave group, and a set of notification scripts.
1) a script running on each physical where multiple cluster nodes are running. It checks whether all the redis server processes are running, and if it is not, it sends a mail and starts it.
2) a script which runs every 10 mins on a few machines(3), and tries to check cluster info, checks the status ok, inserts a predefined key. It also does the same for master slave instance. If it gives error, it retries after 2 minutes, and if it again fails, it sends an email. This script runs on three separate physical achines, given that the replication of the system is 1, and we are willing to acknowledge the cluster failover in case two+ machines are down, it is ok.
3) since on a single physical machine, we have many masters, and other's slaves running, there is a script which runs every hour, and checks whether there are equal number of masters in each physical machine. If there are unequal, then it sends an email(this is needed to avoid the case i described earlier where two machines fail one after the other, leading to the cluster failing, which may not be the case in your case, depending on how your VMs are created from physical machines).
In case of hardware failure, mostly a machine restarting, redis process starts by itself(script 1), script 2 almost never fails because the cluster is ok even after hardware failure within a minute, we somehow have to manually make the cluster balance when script 3 sends mails.
Apart from it, we have daily mailers analysing clusters and pointing out nodes where memory is more than 70% or connections are more than some limit. And a basic dashboard created using graphite and grafana to analyse the cluster stats over time.
Considering machine restart is not that often, once every few months or so, it has all been very stable and manageable.
Hope it helps.