Node exporter use tls will arise msg="collector failed"

531 views
Skip to first unread message

Jack Chew

unread,
Mar 19, 2020, 7:04:56 AM3/19/20
to Prometheus Users
Hi team,

When i use node_exporter web-config.yml for TLS setting will arsie the screenshot error. I try to different node_exporter server also same promble, but i cancel web-config is no promblem.


1png.png
2.png
prometheus.png

Ben Kochie

unread,
Mar 19, 2020, 9:25:19 AM3/19/20
to Jack Chew, Prometheus Users
The schestat error is very strange, that file has existed in the kernel since 2.6.20, which is from 2007. What Linux kernel version are you using?

On Thu, Mar 19, 2020 at 12:05 PM Jack Chew <jack...@gmail.com> wrote:
Hi team,

When i use node_exporter web-config.yml for TLS setting will arsie the screenshot error. I try to different node_exporter server also same promble, but i cancel web-config is no promblem.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c9942705-3267-4fde-b5de-536c8018abfd%40googlegroups.com.

Ben Kochie

unread,
Mar 19, 2020, 9:32:42 AM3/19/20
to Jack Chew, Prometheus Users
https://github.com/prometheus/node_exporter/pull/1641 will take care of the schestat log noise. But it's very odd that it would be missing on any modern Linux system.

Brian Candler

unread,
Mar 19, 2020, 1:22:06 PM3/19/20
to Prometheus Users
On Thursday, 19 March 2020 11:04:56 UTC, Jack Chew wrote:
When i use node_exporter web-config.yml for TLS setting will arsie the screenshot error. I try to different node_exporter server also same promble, but i cancel web-config is no promblem.

node_exporter 1.0.0 rc1?  TLS with client cert authentication works for me.

But if you're doing full TLS with client certs, you need authentication in both directions:

- The server (node_exporter) needs a certificate signed by a CA
- The name in the certificate (CN or SAN) needs to match either the hostname that prometheus is connecting to, or the "server_name" setting in tls_config if that is specified
- The client (prometheus) needs a certificate signed by a CA [not necessarily the same one]
- The server (node_exporter) doesn't care about the identity in the certificate, but it does need the CA cert of the certificate which signed prometheus' cert.

Here's how I make this work with two keys and certs: one for prometheus, and one shared by all the node_exporters.

I am going to assume you do the following on the prometheus server, and node_exporter is also running on this node (reachable as 127.0.0.1:9100), and show how to build it up in stages.

1. create a key and certificate for node_exporter to use:

mkdir /etc/prometheus/ssl
cd /etc/prometheus/ssl
openssl req -x509 -newkey rsa:1024 -keyout prom_node_key.pem -out prom_node_cert.pem -days 29220 -nodes -subj /commonName=prom_node/

Type `ls` and you should see two files: `prom_node_cert.pem` and `prom_node_key.pem`.  This is how the node_exporter identifies itself to prometheus.

2. create a file `/etc/prometheus/node_tls.yml` with the following contents:

tlsConfig:
  tlsCertPath: /etc/prometheus/ssl/prom_node_cert.pem
  tlsKeyPath: /etc/prometheus/ssl/prom_node_key.pem

3. Change your node_exporter config to add

--web.config=/etc/prometheus/node_tls.yml

to the command-line options it runs with (e.g. edit your systemd unit file, or /etc/default/node_exporter, or whatever).  Restart it and check for errors.

4. Now we can do a test scrape using curl and https:

curl --cacert /etc/prometheus/ssl/prom_node_cert.pem --resolve prom_node:9100:127.0.0.1 -v https://prom_node:9100/metrics

The scrape should be successful.  We've done it over https.  We've used the fake hostname "prom_node" to match the certificate, and told curl to use address 127.0.0.1 for this hostname, and to verify the certificate in prom_node_cert.pem.

If it doesn't work at this point, fix the problem before proceeding.

However, still anyone is authorized to scrape.  So now we need to make a new key and cert for the prometheus server to use when scraping, and configure node_exporter so that it only accepts scrapes from someone with this key.

5. Create the new key and cert for prometheus:

cd /etc/prometheus/ssl
openssl req -x509 -newkey rsa:1024 -keyout prometheus_key.pem -out prometheus_cert.pem -days 29220 -nodes -subj /commonName=prometheus/

6. Edit `/etc/prometheus/node_tls.yml` so it looks like this:

tlsConfig:
  tlsCertPath: /etc/prometheus/ssl/prom_node_cert.pem
  tlsKeyPath: /etc/prometheus/ssl/prom_node_key.pem

  clientAuth: RequireAndVerifyClientCert
  clientCAs: /etc/prometheus/ssl/prometheus_cert.pem

Restart node_exporter.

7. Now re-run the *exact* same curl command as you did before:

curl --cacert /etc/prometheus/ssl/prom_node_cert.pem --resolve prom_node:9100:127.0.0.1 -v https://prom_node:9100/metrics

This time you should see an error:

curl: (35) gnutls_handshake() failed: Certificate is bad

This is because the client isn't presenting a certificate to the server to identify itself.

We now need to give a longer curl line (split for clarity):

curl --cert /etc/prometheus/ssl/prometheus_cert.pem \
     --key /etc/prometheus/ssl/prometheus_key.pem \
     --cacert /etc/prometheus/ssl/prom_node_cert.pem \
     --resolve prom_node:9100:127.0.0.1 \

This should now work.  We've proved our identity to node_exporter using the prometheus private key, and node_exporter will now talk to us.

8. Now you just need to change the prometheus config to scrape using tls.

Edit your prometheus.yml and find the section which scrapes node_exporter.  Edit it so that it includes scheme: https and a tls_config section as below.

  - job_name: 'node'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets.d/node.yml
    scheme: https
    tls_config:
      # Verifying remote identity
      ca_file: /etc/prometheus/ssl/prom_node_cert.pem
      server_name: prom_node
      # Asserting our identity
      cert_file: /etc/prometheus/ssl/prometheus_cert.pem
      key_file: /etc/prometheus/ssl/prometheus_key.pem


Signal prometheus to re-read its configuration, and check for errors:

killall -HUP prometheus
journalctl -eu prometheus   # e.g. if you are running prometheus under systemd

9. Deployment to other nodes

To deploy this to remote nodes with node_exporter, you would copy the following files to them:

* `/etc/default/node_exporter` (or however you set the command line options on node_exporter)
* `/etc/prometheus/node_tls.yml`
* `/etc/prometheus/ssl/prom_node_cert.pem`
* `/etc/prometheus/ssl/prom_node_key.pem`
* `/etc/prometheus/ssl/prometheus_cert.pem`

but NOT `prometheus_key.pem`.  That file is private to the prometheus server only; it's ownership of this key which proves the prometheus server's identity.

Jack Chew

unread,
Mar 23, 2020, 1:42:03 AM3/23/20
to Prometheus Users
Hi Ben Kochie,

My kernel is 3.10.0-693.el7.x86_64. 

在 2020年3月19日星期四 UTC+8下午9:25:19,Ben Kochie写道:
The schestat error is very strange, that file has existed in the kernel since 2.6.20, which is from 2007. What Linux kernel version are you using?

On Thu, Mar 19, 2020 at 12:05 PM Jack Chew <jack...@gmail.com> wrote:
Hi team,

When i use node_exporter web-config.yml for TLS setting will arsie the screenshot error. I try to different node_exporter server also same promble, but i cancel web-config is no promblem.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Jack Chew

unread,
Mar 23, 2020, 1:45:06 AM3/23/20
to Prometheus Users
Hi Brian Candler,

Thanks you manual, i try is TLS is work, but problem still, My node_exporter is 1.0.0 rc1. Now node_exporter arise issue is    level=error ts=2020-03-23T05:27:56.168Z caller=collector.go:161 msg="collector failed" name=softnet duration_seconds=3.915e-05 err="could not get softnet statistics: failed to parse /proc/net/softnet_stat: 10 columns were detected, but 11 were expected"

在 2020年3月20日星期五 UTC+8上午1:22:06,Brian Candler写道:
png123.png

Brian Candler

unread,
Mar 23, 2020, 4:59:57 AM3/23/20
to Prometheus Users
Sounds like a problem unrelated to TLS.

It also sounds like you are running some very strange or ancient kernel, for the same reason that you had problems with schedstat.

You said 3.10.0-693.el7.x86_64.  If that's the standard CentOS 7 kernel then you could take this issue up on github for node_exporter.

Brian Candler

unread,
Mar 23, 2020, 6:12:09 AM3/23/20
to Prometheus Users
Changing from 10 to 11 values in /proc/net/softnet_stat was done in May 2013 and went into kernels 3.11+

Given that RHEL/CentOS 7 is still maintained, I think it's reasonable to open a ticket for node_exporter to make parsing of the 11th value optional.

Jack Chew

unread,
Mar 23, 2020, 7:29:23 AM3/23/20
to Prometheus Users
Thanks Brian Candler,

I try to three different service provider centos server,alse will arise same issue.

uname -a
1.pccw 
CentOS Linux release 7.7.1908 (Core)
Linux mybigjj 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

2.azure
CentOS release 6.10 (Final)
Linux testing 2.6.32-754.27.1.el6.x86_64 #1 SMP Tue Jan 28 14:11:45 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

3.ovh
CentOS Linux release 7.7.1908 (Core)
Linux ovhtest 4.19-ovh-xxxx-std-ipv6-64 #1125149 SMP Fri Feb 21 08:31:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

So, I want open case in node_exporter github?

在 2020年3月23日星期一 UTC+8下午6:12:09,Brian Candler写道:
microsolf.png
ovh.png
png123.png

Brian Candler

unread,
Mar 23, 2020, 8:33:10 AM3/23/20
to Prometheus Users
The ovh one has a newer kernel but is still giving the schedstat error; you'll need to build it yourself from git so that you have the patch from https://github.com/prometheus/node_exporter/pull/1641

Or wait for 1.0.0 final or the next rc.

So, I want open case in node_exporter github?


Yes, specifically for the softnet_stat problem.

Jack Chew

unread,
Mar 23, 2020, 8:39:40 AM3/23/20
to Prometheus Users

Ok Thank Brian Candler
在 2020年3月23日星期一 UTC+8下午8:33:10,Brian Candler写道:
Reply all
Reply to author
Forward
0 new messages