How to debug missmatched target at scrape time on a running Prometheus

725 views
Skip to first unread message

Samuel Alfageme

unread,
Jul 30, 2019, 1:29:53 PM7/30/19
to Prometheus Users

Hi there,


Last week, we hit a few connectivity issues on one of our production environments that caused a few moments on which Prometheus was unable to scrape metrics from their targets while our SDN controller rebooted.


The thing is, after one of these hiatus I noticed one of our physical hosts started reporting much less total memory (i.e. via node_exporter's node_memory_MemTotal_bytes - which went down from ~500 to 3.85GiB) on one of our grafana dashboards that tracks memory pressure per host:


memory_usage.png


To explain this behavior; I tried to go all way down the rabbit hole. Since we use file-based service-discovery for targets, I went to the source looking for server #4's target definition:


cat /etc/prometheus/tgroups/targets.json | python -m json.tool
[
    {
        "labels": {
            "env": "pro",
            "job": "node_exporter"
        },
        "targets": [
            "pro-oln-prometheus:9100",
            "n001:9100",
            "n002:9100",
            "n003:9100",
            "n004:9100",
            "n005:9100",


Then tried querying directly that host's node_exporter endpoint from our Prometheus instance via cURL w/


curl -s http://n004:9100/metrics | grep node_memory_MemTotal_bytes
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 5.40937314304e+11

... which actually looks like the real value and not the one reported by Prometheus. Our internal DNS seems to get the address right when resolving the n004 host name and a tcpdump targeting the 9100 port inside the host with the Prometheus IP as src revealed no interaction between those two as it does happen with the rest of our servers. At this point I thought in using 'node_uname_info' metric to peek at the value of the 'nodename' - this gave me the answer:

node_uname_info{domainname="(none)",env="pro",instance="n004:9100",job="node_exporter",machine="x86_64",nodename="pro-oln-prometheus",release="4.15.0-45-generic",sysname="Linux",version="#48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019"} 

There you go; somehow Prometheus mixed up this target at some point with itself. Now I guess sending a SIGHUP right to the process will cause Prometheus to re-discover all the targets and fix this issue. However I'm curious on looking ways to debug this while it's happening for an explanation and possible fixes, you know - for science.

Best,
Samuel Alfageme

Samuel Alfageme

unread,
Jul 30, 2019, 1:32:18 PM7/30/19
to Prometheus Users
Somehow the screenshot was corrupted after posting the message. Here we go:

memory_usage.png

Simon Pasquier

unread,
Jul 31, 2019, 3:26:31 AM7/31/19
to Samuel Alfageme, Prometheus Users
I suspect that something got mixed up during the initial name resolution.
Sending a SIGHUP may not be sufficient: if the targets (and their
associated labels) don't change, Prometheus won't recreate them. If it
is the case, you can remove "n004:9100" from the target list, SIGHUP
Prometheus, add again "n004:9100" and finally SIGHUP.


On Tue, Jul 30, 2019 at 7:29 PM Samuel Alfageme
<samuel....@gmail.com> wrote:
>
> Hi there,
>
>
> Last week, we hit a few connectivity issues on one of our production environments that caused a few moments on which Prometheus was unable to scrape metrics from their targets while our SDN controller rebooted.
>
>
> The thing is, after one of these hiatus I noticed one of our physical hosts started reporting much less total memory (i.e. via node_exporter's node_memory_MemTotal_bytes - which went down from ~500 to 3.85GiB) on one of our grafana dashboards that tracks memory pressure per host:
>
>
>
> To explain this behavior; I tried to go all way down the rabbit hole. Since we use file-based service-discovery for targets, I went to the source looking for server #4's target definition:
>
>
> cat /etc/prometheus/tgroups/targets.json | python -m json.tool
> [
> {
> "labels": {
> "env": "pro",
> "job": "node_exporter"
> },
> "targets": [
> "pro-oln-prometheus:9100",
> "n001:9100",
> "n002:9100",
> "n003:9100",
> "n004:9100",
> "n005:9100",
>
>
> Then tried querying directly that host's node_exporter endpoint from our Prometheus instance via cURL w/
>
>
> curl -s http://n004:9100/metrics | grep node_memory_MemTotal_bytes
> # HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
> # TYPE node_memory_MemTotal_bytes gauge
> node_memory_MemTotal_bytes 5.40937314304e+11
>
> ... which actually looks like the real value and not the one reported by Prometheus. Our internal DNS seems to get the address right when resolving the n004 host name and a tcpdump targeting the 9100 port inside the host with the Prometheus IP as src revealed no interaction between those two as it does happen with the rest of our servers. At this point I thought in using 'node_uname_info' metric to peek at the value of the 'nodename' - this gave me the answer:
>
> node_uname_info{domainname="(none)",env="pro",instance="n004:9100",job="node_exporter",machine="x86_64",nodename="pro-oln-prometheus",release="4.15.0-45-generic",sysname="Linux",version="#48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019"}
>
> There you go; somehow Prometheus mixed up this target at some point with itself. Now I guess sending a SIGHUP right to the process will cause Prometheus to re-discover all the targets and fix this issue. However I'm curious on looking ways to debug this while it's happening for an explanation and possible fixes, you know - for science.
>
> Best,
> Samuel Alfageme
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b36e774f-75cd-49f7-bfe7-9f953526dd46%40googlegroups.com.

Samuel Alfageme

unread,
Aug 5, 2019, 3:37:28 AM8/5/19
to Simon Pasquier, Prometheus Users
You’re absolutely right - SIGHUP was not enough to re-discover the target with its right address. 

Since I noticed out of pure luck (as I was tracking down these specific hosts and spotted the anomaly) - could we think of a automatic recovery mechanism to allow these kind of name-resolution mismatches to be fixed gracefully (i.e. without the need to do any manual steps) every once in a while. 

Best, 
Samuel
Reply all
Reply to author
Forward
0 new messages