mysqld_exporter contains some useful code for a check on heartbeat lagging
groups:
- name: example.rules
rules:
- record: mysql_heartbeat_lag_seconds
expr: mysql_heartbeat_now_timestamp_seconds - mysql_heartbeat_stored_timestamp_seconds
...
- alert: MySQLReplicationLag
expr: (mysql_heartbeat_lag_seconds > 30) and ON(instance) (predict_linear(mysql_heartbeat_lag_seconds[5m],
60 * 2) > 0)
Now, in my case the master server_id may change due to the way we operate our MySQL cluster, and hence, we may get the following metrics
{instance="batchdb001.mo-staging99-nonprod.dus1.cloud",job="prometheus-mysqld-exporter",server_id="2001500"} 0.5187849998474121
{instance="batchdb001.mo-staging99-nonprod.dus1.cloud",job="prometheus-mysqld-exporter",server_id="3212"} 1594051555.519615
As you can see, for one instance there's multiple metrics only one of which is the right one as it refers to the correct server_id. In principle, it's easy to determine the correct one as there's also a metric mysql_slave_status_master_server_id which returns the correct server_id:
mysql_slave_status_master_server_id{instance="batchdb001.mo-staging99-nonprod.dus1.cloud",job="prometheus-mysqld-exporter",master_host="dbmaster001",master_uuid="005e9c3d-baea-11ea-ab06-027e6d15fde3"}. 2001500
so for the alert definition I would have to take into account the server_id:
- alert: MySQLReplicationLag
expr: (mysql_heartbeat_lag_seconds{server_id="2001500"} > 30) and ON(instance) ...
but how to do this in my case, where server_id has to be compared with a metrics value (
mysql_slave_status_master_server_id)?