"Context deadline exceeded" for Target node_exporter

7,382 views
Skip to first unread message

Meng Zhao

unread,
Oct 25, 2017, 2:53:52 AM10/25/17
to Prometheus Users

  Hello everyone,

What did you do?

Tried to run Prometheus server in a ubuntu server and scrape metrics from node_exporter instance running in a few private ip server . (see environment section.)

Used a basic config which scrapes these private ip server, via its private ip address, every 60 seconds.

Noticed that this very basic target was showing as DOWN in the Prometheus dashboard targets view, with the error: "context deadline exceeded"


What did you expect to see?

Both target groups reporting as UP and scrapes completing quickly



What did you see instead? Under which circumstances?

DOWN "context deadline exceeded"




Sometimes, it will become a down state. a few hours later, It will become the up state , but I'm not doing anything..

It looks like that the scrape request timed out. But when I set scrape_timeout: 120s,  still appear "context, deadline exceeded" error



Environment

System information:
  • DISTRIB_CODENAME=xenial 
  • DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
  • Linux 4.4.0-47-generic x86_64

Prometheus version:

prometheus, version 1.7.1 (branch: master, revision: 3afb3fffa3a29c3de865e1172fb740442e9d0133)
build user: root@0aa1b7fc430d
build date: 20170612-11:44:05
 go version: go1.8.3

Prometheus configuration file:

global:
  scrape_interval
:     30s
  evaluation_interval
: 30s

  external_labels
:
      monitor
: 'Video-monitor'

rule_files
:
   
- "rules/alert.rules"

scrape_configs
:
 
- job_name: 'Video_prometheus'
    static_configs
:
     
- targets: ['192.168.1.13:9090']
        labels
:
         
alias: video-prometheus

 
- job_name: 'Video_node_exporter'
    scrape_interval
: 120s
    scrape_timeout
: 120s
    static_configs
:
     
- targets: ['192.168.1.13:9100']
        labels
:
         
alias: video-prometheus
     
- targets: ['192.168.1.2:9100']
        labels
:
         
alias: video-master
     
- targets: ['192.168.1.3:9100']
        labels
:
         
alias: video-node01
     
- targets: ['192.168.1.4:9100']
        labels
:
         
alias: video-node02
     
- targets: ['192.168.1.5:9100']
        labels
:
         
alias: video-node03
     
- targets: ['192.168.1.6:9100']
        labels
:
         
alias: video-node04
     
- targets: ['192.168.1.7:9100']
        labels
:
         
alias: video-node05
     
- targets: ['192.168.1.8:9100']
        labels
:
         
alias: video-node06
         

  • Logs:
    time="2017-10-25T12:39:05+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633" 
    time
    ="2017-10-25T12:39:24+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 19.442712172s." source="persistence.go:665"
    time
    ="2017-10-25T12:40:32+08:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=51609 memoryChunks=401057 source="storage.go:1867" urgencyScore=1
    time
    ="2017-10-25T12:40:32+08:00" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=51457 memoryChunks=385939 source="storage.go:1007" urgencyScore=1
    time
    ="2017-10-25T12:40:48+08:00" level=info msg="Storage has left rushed mode." chunksToPersist=46416 memoryChunks=387607 source="storage.go:1857" urgencyScore=0.678
    time
    ="2017-10-25T12:40:48+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:41:15+08:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=47004 memoryChunks=390279 source="storage.go:1867" urgencyScore=0.824
    time
    ="2017-10-25T12:41:33+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 45.583003977s." source="persistence.go:665"
    time
    ="2017-10-25T12:41:35+08:00" level=info msg="Storage does not need throttling anymore." chunksToPersist=47012 memoryChunks=390413 source="storage.go:1015" urgencyScore=0.824
    time
    ="2017-10-25T12:41:50+08:00" level=info msg="Storage has left rushed mode." chunksToPersist=45292 memoryChunks=391119 source="storage.go:1857" urgencyScore=0.643
    time
    ="2017-10-25T12:42:19+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:42:45+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 26.396031303s." source="persistence.go:665"
    time
    ="2017-10-25T12:45:19+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:45:54+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 34.336368215s." source="persistence.go:665"
    time
    ="2017-10-25T12:48:23+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:48:47+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 24.752371907s." source="persistence.go:665"
    time
    ="2017-10-25T12:52:05+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:52:26+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 21.347779175s." source="persistence.go:665"
    time
    ="2017-10-25T12:55:19+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:55:47+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 27.398922414s." source="persistence.go:665"
    time
    ="2017-10-25T12:59:01+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T12:59:19+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 17.44142578s." source="persistence.go:665"
    time
    ="2017-10-25T13:02:00+08:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=42452 memoryChunks=382769 source="storage.go:1867" urgencyScore=0.894
    time
    ="2017-10-25T13:02:01+08:00" level=info msg="Storage has left rushed mode." chunksToPersist=41925 memoryChunks=382769 source="storage.go:1857" urgencyScore=0.685
    time
    ="2017-10-25T13:02:01+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T13:02:30+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 29.087830198s." source="persistence.go:665"
    time
    ="2017-10-25T13:02:37+08:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=41838 memoryChunks=383493 source="storage.go:1867" urgencyScore=0.844
    time
    ="2017-10-25T13:02:40+08:00" level=info msg="Storage has left rushed mode." chunksToPersist=40930 memoryChunks=383563 source="storage.go:1857" urgencyScore=0.641
    time
    ="2017-10-25T13:03:09+08:00" level=info msg="Completed maintenance sweep through 56789 archived fingerprints in 6h31m42.655936538s." source="storage.go:1442"
    time
    ="2017-10-25T13:03:32+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
    time
    ="2017-10-25T13:04:00+08:00" level=info msg="Done checkpointing in-memory metrics and chunks in 28.79803733s." source="persistence.go:665"
    time
    ="2017-10-25T13:06:53+08:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"


Thank you very much  
Message has been deleted

Meng Zhao

unread,
Oct 25, 2017, 5:51:20 AM10/25/17
to Prometheus Users



These two days, this problem appeared many times


Ben Kochie

unread,
Oct 25, 2017, 9:59:26 AM10/25/17
to Meng Zhao, Prometheus Users
A few questions:
* What version of node_exporter?
* Is this VM/cloud or bare metal?
* What flags are you passing to the node_exporter?


On Wed, Oct 25, 2017 at 11:51 AM, Meng Zhao <zhaome...@gmail.com> wrote:



These two days, this problem appeared many times


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/803b5a3c-e918-4e79-bd24-1520b0bbfc36%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Meng Zhao

unread,
Oct 25, 2017, 10:42:26 PM10/25/17
to Prometheus Users

hello,

 my node_exporter version:
node_exporter, version 0.14.0 (branch: master, revision: 840ba5dcc71a084a3bc63cb6063003c1f94435a6)
  build user:       root@bb6d0678e7f3
  build date:       20170321-12:12:54
  go version:       go1.7.5



They are VM server,

Nothing flags,   just  " nohub ./node_exporter &".




My  prometheus  server ip 192.168.1.13, one of my node_exporter(IP:192.168.1.10 ,hostname :video-node08)


When there is a lot of state of the down of time, I am running  `netstat -anpt |grep 9100`  in Prometheus server.


netstat -anpt |grep 9100

tcp        0      0 192.168.1.13:54006      192.168.1.23:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:55320      192.168.1.30:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:34320      192.168.1.26:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:59492      192.168.1.11:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:56348      192.168.1.4:9100        FIN_WAIT2   -               
tcp        0      0 192.168.1.13:59468      192.168.1.12:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:41684      192.168.1.31:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:34608      192.168.1.26:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:59756      192.168.1.12:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:43668      192.168.1.34:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:45682      192.168.1.32:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:60682      192.168.1.6:9100        FIN_WAIT2   -               
tcp        0      0 192.168.1.13:36934      192.168.1.27:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:60298      192.168.1.20:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:55608      192.168.1.30:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:37170      192.168.1.10:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:45394      192.168.1.32:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:41396      192.168.1.31:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:54294      192.168.1.23:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:42906      192.168.1.22:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:34140      192.168.1.5:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:58214      192.168.1.35:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:36440      192.168.1.8:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:60586      192.168.1.20:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:41794      192.168.1.21:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:58502      192.168.1.35:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:56636      192.168.1.4:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:59204      192.168.1.11:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:60970      192.168.1.6:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:51494      192.168.1.3:9100        FIN_WAIT2   -               
tcp        0      0 192.168.1.13:51994      192.168.1.24:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:55176      192.168.1.7:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:59764      192.168.1.29:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:44736      192.168.1.28:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:51782      192.168.1.3:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:46882      192.168.1.9:9100        ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:48168      192.168.1.25:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:42618      192.168.1.22:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:48366      192.168.1.33:9100       ESTABLISHED 73915/prometheus
tcp        0      0 192.168.1.13:41506      192.168.1.21:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:36646      192.168.1.27:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:51706      192.168.1.24:9100       FIN_WAIT2   -               
tcp6       0      0 :::9100                 :::*                    LISTEN      2018/node_exporter


one of my down  status  node_exporter(IP:192.168.1.10 )

netstat -anpt |grep 9100 |grep 192.168.1.10

tcp        0      0 192.168.1.13:37458      192.168.1.10:9100       FIN_WAIT2   -               
tcp        0      0 192.168.1.13:37746      192.168.1.10:9100       ESTABLISHED 73915/prometheus 



I try to use tcpdump and look at what's really happening on the network.     in my node_exporter(IP:192.168.1.10 )

tcpdump -i ens224 |grep 192.168.1.13.37746
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens224, link-type EN10MB (Ethernet), capture size 262144 bytes
18:04:10.891665 IP 192.168.1.13.37746 > localhost.example.9100: Flags [F.], seq 1723464181, ack 3655520680, win 229, options [nop,nop,TS val 806110754 ecr 1021397731], length 0
18:04:10.931419 IP localhost.example.9100 > 192.168.1.13.37746: Flags [.], ack 1, win 235, options [nop,nop,TS val 1021427741 ecr 806110754], length 0
18:04:42.805920 IP localhost.example.9100 > 192.168.1.13.37746: Flags [P.], seq 1:4097, ack 1, win 235, options [nop,nop,TS val 1021435709 ecr 806110754], length 4096
18:04:42.805945 IP localhost.example.9100 > 192.168.1.13.37746: Flags [.], seq 4097:11337, ack 1, win 235, options [nop,nop,TS val 1021435709 ecr 806110754], length 7240
18:04:42.805951 IP localhost.example.9100 > 192.168.1.13.37746: Flags [.], seq 11337:14233, ack 1, win 235, options [nop,nop,TS val 1021435709 ecr 806110754], length 2896
18:04:42.806158 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0
18:04:42.806177 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0
18:04:42.806185 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0
18:04:42.806188 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0
18:04:42.806190 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0
18:04:42.806192 IP 192.168.1.13.37746 > localhost.example.9100: Flags [R], seq 1723464182, win 0, length 0




But when all is up,  I am running  `netstat -anpt |grep 9100`  in Prometheus server.


netstat -anpt |grep 9100

tcp6       0      0 :::9100                 :::*                    LISTEN      2018/node_exporter
tcp6       0      0 192.168.1.13:9100       192.168.1.13:47460      TIME_WAIT   -       





 




在 2017年10月25日星期三 UTC+8下午9:59:26,Ben Kochie写道:
A few questions:
* What version of node_exporter?
* Is this VM/cloud or bare metal?
* What flags are you passing to the node_exporter?

On Wed, Oct 25, 2017 at 11:51 AM, Meng Zhao <zhaome...@gmail.com> wrote:



These two days, this problem appeared many times


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,
Oct 26, 2017, 4:33:17 AM10/26/17
to Meng Zhao, Prometheus Users
It sounds like one of the collectors is hanging,  One way to debug this is to startup a node_exporter on a separate port, with debug logs enabled, and use curl from the Prometheus server to try and reproduce the problem..

Are you running NFS filesystems?  These can cause the exporter to hang sometimes.

With 0.14.0, you can try this command:

./node_exporter -log.level debug -web.listen-address ":9101"

Then


This should give you some nice log output to see what is going on.

One suggestion, instead of using nohup to run the exporter, you should create a systemd unit file.  This will make it easier to use, and give you logs in syslog.


To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7e5b73c2-c5c1-45da-9b33-56d779bf8155%40googlegroups.com.

Meng Zhao

unread,
Oct 30, 2017, 12:07:06 AM10/30/17
to Prometheus Users

hello,  

       My filesystems  is NFS. But Why  NFS filesystems cant cause the exporter to hang sometimes and how to solve? 

        I tried this command 
./node_exporter -log.level debug -web.listen-address ":9101"

Then     "curl http://192.168.1.10:9101/metrics"  twice.   The following is detailed information:

For the first time for 3 seconds, 


root@video-node08:/opt/node_exporter-0.14.0.linux-amd64# ./node_exporter -log.level debug -web.listen-address ":9101"


INFO[0000] Starting node_exporter (version=0.14.0, branch=master, revision=840ba5dcc71a084a3bc63cb6063003c1f94435a6)  source="node_exporter.go:140"


INFO[0000] Build context (go=go1.7.5, user=root@bb6d0678e7f3, date=20170321-12:12:54)  source="node_exporter.go:141"


INFO[0000] No directory specified, see --collector.textfile.directory  source="textfile.go:57"


INFO[0000] Enabled collectors:                           source="node_exporter.go:160"


INFO[0000]  - textfile                                   source="node_exporter.go:162"


INFO[0000]  - time                                       source="node_exporter.go:162"


INFO[0000]  - zfs                                        source="node_exporter.go:162"


INFO[0000]  - filesystem                                 source="node_exporter.go:162"


INFO[0000]  - hwmon                                      source="node_exporter.go:162"


INFO[0000]  - infiniband                                 source="node_exporter.go:162"


INFO[0000]  - sockstat                                   source="node_exporter.go:162"


INFO[0000]  - diskstats                                  source="node_exporter.go:162"


INFO[0000]  - edac                                       source="node_exporter.go:162"


INFO[0000]  - mdadm                                      source="node_exporter.go:162"


INFO[0000]  - meminfo                                    source="node_exporter.go:162"


INFO[0000]  - uname                                      source="node_exporter.go:162"


INFO[0000]  - wifi                                       source="node_exporter.go:162"


INFO[0000]  - entropy                                    source="node_exporter.go:162"


INFO[0000]  - filefd                                     source="node_exporter.go:162"


INFO[0000]  - loadavg                                    source="node_exporter.go:162"


INFO[0000]  - netstat                                    source="node_exporter.go:162"


INFO[0000]  - conntrack                                  source="node_exporter.go:162"


INFO[0000]  - netdev                                     source="node_exporter.go:162"


INFO[0000]  - stat                                       source="node_exporter.go:162"


INFO[0000]  - vmstat                                     source="node_exporter.go:162"


INFO[0000] Listening on :9101                            source="node_exporter.go:186"


DEBU[0005] OK: mdadm collector succeeded after 0.000052s.  source="node_exporter.go:98"


DEBU[0005] OK: sockstat collector succeeded after 0.000415s.  source="node_exporter.go:98"


DEBU[0005] Unable to detect InfiniBand devices           source="infiniband_linux.go:110"


DEBU[0005] OK: infiniband collector succeeded after 0.000092s.  source="node_exporter.go:98"


DEBU[0005] Return time: 1509334424.000000                source="time.go:47"


DEBU[0005] OK: time collector succeeded after 0.000026s.  source="node_exporter.go:98"


DEBU[0005] Cannot open "/proc/spl/kstat/zfs/arcstats" for reading. Is the kernel module loaded?  source="zfs_linux.go:32"


DEBU[0005] ZFS / ZFS statistics are not available        source="zfs.go:62"


DEBU[0005] OK: textfile collector succeeded after 0.000001s.  source="node_exporter.go:98"


DEBU[0005] OK: zfs collector succeeded after 0.000056s.  source="node_exporter.go:98"


DEBU[0005] OK: vmstat collector succeeded after 0.000347s.  source="node_exporter.go:98"


DEBU[0005] OK: netstat collector succeeded after 0.001169s.  source="node_exporter.go:98"


DEBU[0005] OK: wifi collector succeeded after 0.000100s.  source="node_exporter.go:98"


DEBU[0005] OK: uname collector succeeded after 0.001157s.  source="node_exporter.go:98"


DEBU[0005] Ignoring device: sda2                         source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: sda5                         source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop0                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop2                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: sda1                         source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop5                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop7                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop3                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop1                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop4                        source="diskstats_linux.go:175"


DEBU[0005] Ignoring device: loop6                        source="diskstats_linux.go:175"


DEBU[0005] OK: diskstats collector succeeded after 0.000961s.  source="node_exporter.go:98"


DEBU[0005] OK: entropy collector succeeded after 0.000031s.  source="node_exporter.go:98"


DEBU[0005] OK: filefd collector succeeded after 0.000036s.  source="node_exporter.go:98"


DEBU[0005] OK: netdev collector succeeded after 0.002577s.  source="node_exporter.go:98"


DEBU[0005] return load 0: 8.060000                       source="loadavg.go:51"


DEBU[0005] return load 1: 5.540000                       source="loadavg.go:51"


DEBU[0005] return load 2: 4.250000                       source="loadavg.go:51"


DEBU[0005] OK: loadavg collector succeeded after 0.002898s.  source="node_exporter.go:98"


DEBU[0005] Set node_mem: map[string]float64{"PageTables":9.580544e+06, "WritebackTmp":0, "VmallocTotal":3.5184372087808e+13, "Buffers":4.12995584e+08, "Cached":1.2282290176e+10, "Inactive_file":1.1768369152e+10, "SwapTotal":2.143285248e+09, "Writeback":0, "VmallocUsed":0, "HugePages_Surp":0, "Hugepagesize":2.097152e+06, "HugePages_Free":0, "HugePages_Rsvd":0, "SwapFree":2.094804992e+09, "SReclaimable":3.0253056e+08, "KernelStack":1.06496e+07, "NFS_Unstable":1.7522688e+07, "CmaTotal":0, "Active":1.733500928e+09, "Inactive_anon":1.73965312e+08, "Dirty":7.098368e+06, "Shmem":1.272832e+08, "HugePages_Total":0, "HardwareCorrupted":0, "AnonHugePages":2.81018368e+08, "CmaFree":0, "MemFree":2.412945408e+09, "Inactive":1.1942334464e+10, "Active_file":7.97052928e+08, "Bounce":0, "VmallocChunk":0, "DirectMap2M":1.0542383104e+10, "Mlocked":3.74784e+06, "AnonPages":9.84121344e+08, "Mapped":1.67022592e+08, "SUnreclaim":2.27487744e+08, "CommitLimit":1.0547777536e+10, "Active_anon":9.36448e+08, "Slab":5.30018304e+08, "Committed_AS":3.882319872e+09, "MemTotal":1.6808988672e+10, "DirectMap1G":8.589934592e+09, "MemAvailable":1.4934536192e+10, "SwapCached":290816, "Unevictable":3.74784e+06, "DirectMap4k":1.94838528e+08}  source="meminfo.go:48"


DEBU[0005] OK: meminfo collector succeeded after 0.003253s.  source="node_exporter.go:98"


DEBU[0005] OK: edac collector succeeded after 0.000015s.  source="node_exporter.go:98"


DEBU[0005] OK: conntrack collector succeeded after 0.003276s.  source="node_exporter.go:98"


DEBU[0005] OK: stat collector succeeded after 0.001069s.  source="node_exporter.go:98"


DEBU[0005] OK: hwmon collector succeeded after 0.008732s.  source="node_exporter.go:98"


DEBU[0005] Ignoring mount point: /sys                    source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /proc                   source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /dev                    source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /dev/pts                source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/kernel/security    source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /dev/shm                source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup          source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/systemd  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/pstore          source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/net_cls,net_prio  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/hugetlb  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/blkio    source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/devices  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/perf_event  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/freezer  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/cpu,cpuacct  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/memory   source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/cpuset   source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/cgroup/pids     source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /proc/sys/fs/binfmt_misc  source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /dev/mqueue             source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /dev/hugepages          source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/kernel/debug       source="filesystem_linux.go:42"


DEBU[0005] Ignoring mount point: /sys/fs/fuse/connections  source="filesystem_linux.go:42"


DEBU[0008] OK: filesystem collector succeeded after 3.365349s.  source="node_exporter.go:98"






the second with 94 seconds.


DEBU[0602] OK: mdadm collector succeeded after 0.000281s.  source="node_exporter.go:98"
DEBU[0602] OK: textfile collector succeeded after 0.000001s.  source="node_exporter.go:98"
DEBU[0602] return load 0: 4.750000                       source="loadavg.go:51"
DEBU[0602] return load 1: 7.070000                       source="loadavg.go:51"
DEBU[0602] return load 2: 6.270000                       source="loadavg.go:51"
DEBU[0602] Unable to detect InfiniBand devices           source="infiniband_linux.go:110"
DEBU[0602] OK: loadavg collector succeeded after 0.000275s.  source="node_exporter.go:98"
DEBU[0602] OK: vmstat collector succeeded after 0.000520s.  source="node_exporter.go:98"
DEBU[0602] OK: infiniband collector succeeded after 0.000136s.  source="node_exporter.go:98"
DEBU[0602] Return time: 1509335020.000000                source="time.go:47"
DEBU[0602] OK: time collector succeeded after 0.000050s.  source="node_exporter.go:98"
DEBU[0602] OK: conntrack collector succeeded after 0.000054s.  source="node_exporter.go:98"
DEBU[0602] OK: sockstat collector succeeded after 0.000225s.  source="node_exporter.go:98"
DEBU[0602] OK: edac collector succeeded after 0.000028s.  source="node_exporter.go:98"
DEBU[0602] OK: uname collector succeeded after 0.000027s.  source="node_exporter.go:98"
DEBU[0602] Set node_mem: map[string]float64{"Mlocked":3.74784e+06, "Committed_AS":2.74610176e+09, "VmallocChunk":0, "HardwareCorrupted":0, "HugePages_Free":0, "Hugepagesize":2.097152e+06, "DirectMap4k":1.94838528e+08, "Cached":1.348175872e+10, "Active_file":1.000648704e+09, "Shmem":1.27266816e+08, "AnonHugePages":1.71966464e+08, "CmaFree":0, "MemTotal":1.6808988672e+10, "MemAvailable":1.5493750784e+10, "Buffers":4.13782016e+08, "Inactive":1.2938891264e+10, "SwapTotal":2.143285248e+09, "Dirty":57344, "Writeback":1.4680064e+07, "HugePages_Surp":0, "DirectMap1G":8.589934592e+09, "SwapCached":290816, "Inactive_file":1.2765110272e+10, "Unevictable":3.74784e+06, "NFS_Unstable":1.9922944e+07, "WritebackTmp":0, "CmaTotal":0, "HugePages_Rsvd":0, "Active":1.38620928e+09, "SUnreclaim":2.2284288e+08, "VmallocTotal":3.5184372087808e+13, "Active_anon":3.85560576e+08, "Mapped":1.57380608e+08, "Slab":5.23497472e+08, "SReclaimable":3.00654592e+08, "KernelStack":9.568256e+06, "Bounce":0, "AnonPages":4.3307008e+08, "CommitLimit":1.0547777536e+10, "VmallocUsed":0, "HugePages_Total":0, "MemFree":1.773699072e+09, "Inactive_anon":1.73780992e+08, "SwapFree":2.094804992e+09, "PageTables":7.319552e+06, "DirectMap2M":1.0542383104e+10}  source="meminfo.go:48"
DEBU[0602] OK: wifi collector succeeded after 0.000089s.  source="node_exporter.go:98"
DEBU[0602] OK: netdev collector succeeded after 0.000431s.  source="node_exporter.go:98"
DEBU[0602] OK: meminfo collector succeeded after 0.000685s.  source="node_exporter.go:98"
DEBU[0602] OK: entropy collector succeeded after 0.000047s.  source="node_exporter.go:98"
DEBU[0602] OK: filefd collector succeeded after 0.000059s.  source="node_exporter.go:98"
DEBU[0602] OK: stat collector succeeded after 0.000352s.  source="node_exporter.go:98"
DEBU[0602] Cannot open "/proc/spl/kstat/zfs/fm" for reading. Is the kernel module loaded?  source="zfs_linux.go:32"
DEBU[0602] ZFS / ZFS statistics are not available        source="zfs.go:62"
DEBU[0602] OK: zfs collector succeeded after 0.000421s.  source="node_exporter.go:98"
DEBU[0602] OK: netstat collector succeeded after 0.001010s.  source="node_exporter.go:98"
DEBU[0602] Ignoring device: loop2                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: sda1                         source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop7                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: sda2                         source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop1                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop4                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop6                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop0                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop3                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: loop5                        source="diskstats_linux.go:175"
DEBU[0602] Ignoring device: sda5                         source="diskstats_linux.go:175"
DEBU[0602] OK: diskstats collector succeeded after 0.003107s.  source="node_exporter.go:98"
DEBU[0602] OK: hwmon collector succeeded after 0.004411s.  source="node_exporter.go:98"
DEBU[0602] Ignoring mount point: /sys                    source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /proc                   source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /dev                    source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /dev/pts                source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/kernel/security    source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /dev/shm                source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup          source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/systemd  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/pstore          source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/net_cls,net_prio  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/hugetlb  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/blkio    source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/devices  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/perf_event  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/freezer  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/cpu,cpuacct  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/memory   source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/cpuset   source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/cgroup/pids     source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /proc/sys/fs/binfmt_misc  source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /dev/mqueue             source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /dev/hugepages          source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/kernel/debug       source="filesystem_linux.go:42"
DEBU[0602] Ignoring mount point: /sys/fs/fuse/connections  source="filesystem_linux.go:42"
DEBU[0696] Statfs on /var/lib/kubelet/pods/8edacf3f-bd22-11e7-b192-005056b10f29/volumes/kubernetes.io~secret/default-token-dhk6c returned no such file or directory  source="filesystem_linux.go:55"
DEBU[0696] OK: filesystem collector succeeded after 94.682083s.  source="node_exporter.go:98"


This happens again today……































在 2017年10月26日星期四 UTC+8下午4:33:17,Ben Kochie写道:

Meng Zhao

unread,
Oct 31, 2017, 4:52:10 AM10/31/17
to Prometheus Users


      Can you tell me why  NFS filesystems could cause the exporter to hang?   Should I  change the filesystems?         

      Everytime the exporter to hang, I will receiving a lot of email.    this is our production environment, so I want to solve this problem as soon as possible.
     
     Thank you for your attention to this matter.
    
      

在 2017年10月26日星期四 UTC+8下午4:33:17,Ben Kochie写道:
It sounds like one of the collectors is hanging,  One way to debug this is to startup a node_exporter on a separate port, with debug logs enabled, and use curl from the Prometheus server to try and reproduce the problem..

Ben Kochie

unread,
Oct 31, 2017, 8:34:19 AM10/31/17
to Meng Zhao, Prometheus Users
NFS can cause blocking IO, depending on how you have your mounts configured.

I would update the node_exporter flags to ignore NFS filesystem mounts, this will prevent the hangs.  You can monitor NFS filesystem space on the NFS server directly.

-collector.filesystem.ignored-fs-types '^(nfs.*|(sys|proc|auto)fs)$'

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b4fca44d-4296-4336-b7d9-dd286bc8ca11%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages