Hi there,
Last week, we hit a few connectivity issues on one of our production environments that caused a few moments on which Prometheus was unable to scrape metrics from their targets while our SDN controller rebooted.
The thing is, after one of these hiatus I noticed one of our physical hosts started reporting much less total memory (i.e. via node_exporter's node_memory_MemTotal_bytes - which went down from ~500 to 3.85GiB) on one of our grafana dashboards that tracks memory pressure per host:
To explain this behavior; I tried to go all way down the rabbit hole. Since we use file-based service-discovery for targets, I went to the source looking for server #4's target definition:
cat /etc/prometheus/tgroups/targets.json | python -m json.tool
[
{
"labels": {
"env": "pro",
"job": "node_exporter"
},
"targets": [
"pro-oln-prometheus:9100",
"n001:9100",
"n002:9100",
"n003:9100",
"n004:9100",
"n005:9100",
Then tried querying directly that host's node_exporter endpoint from our Prometheus instance via cURL w/
