Which metrics to monitor (seeking expert advice)?

34 views
Skip to first unread message

Mark

unread,
Apr 9, 2018, 5:35:13 PM4/9/18
to Redis DB
Hi,
I'm building a monitoring system (which will include ability to monitor Redis) and want to set up default alerts (threshold or anomaly) on 2-3 key metrics everyone who uses Redis would typically want to alert on, but I don't yet have production-grade experience with Redis, so I decided to ask for advice here.

These "default alerts" will be automatically created for each user that chooses to monitor Redis, so rules have to be generally useful (it wouldn't make sense to alert on metrics whose values vary wildly based on the size of deployment).

I guess I did complicate that description a bit, so let me ask the question with a single sentence:
Which metrics would be most significant indicators that something went wrong with your Redis deployment?

Thanks very much,
Mark

Thomas Love

unread,
Apr 10, 2018, 4:18:52 AM4/10/18
to Redis DB
The one that's saved me a couple of times is used_memory_rss, indicating application memory leaks that threaten the system. But obviously this varies by deployment. A config-free monitoring system would need to learn the baseline and variance over some learning period (and then still be configurable - because false positives kill the value of monitoring.) 

Other warning metrics might be found in rdb/aof (persistence) and replication stats, but their interpretation is also application-dependent, and their failure would probably usually be a symptom of disk or network issues that are better covered by lower-level, higher-priority tests - red herrings also undermine monitoring. 

Thomas

Joshua Scott

unread,
Apr 10, 2018, 10:28:21 AM4/10/18
to Redis DB
* Latency metrics might be a generally useful one.
* RSS that is > 1.2x your maxmemory setting.
* Number of clients approaching the max client limit
* high utilization - you can extract CPU time from the INFO cpu command, and determine if redis is using more than say 80% of the CPU, which might lead to high latency spikes
* Replication - make sure slaves are not falling behind, or that they stay connected.
Reply all
Reply to author
Forward
0 new messages