Hi all,
We've found an issue where replication metrics related to the remotes (replication_delay,
replication_latency, replication_retries) disappear from Prometheus after running `gerrit
plugin reload replication`.
The Issue:
When reloading the replication plugin, metrics frequently disappear once the command completes. If they were already missing before, they would briefly reappear right after starting the reload, but then would disappear again when the command completes. Running `replication start --all` afterwards does not restore them, even though replication is clearly happening in the logs.
This seems to happen without failure for our large gerrit instance, but our smaller gerrit instance does not face this issue as frequently, meaning it's a race condition.
Root Cause:
The plugin is using the plugin-scoped MetricMaker (actually PluginMetricMaker), which tracks all created metrics in a cleanup set. We believe that during reload, when the old plugin's stop() method runs, it removes all metrics from this cleanup set by name. Since the new plugin's metrics have the same names, they get removed from Gerrit's metric registry descriptions map, making them invisible to Prometheus.
Proposed fix:
https://gerrit-review.googlesource.com/c/plugins/replication/+/531381
I've tested this on Gerrit 3.12 and metrics now persist correctly after plugin reload, without fail.
Best regards,
Ewout Maertens