Hi,
this might be a systemd problem but I am interested if anyone experienced something similar and if there is a recommended workaround.
I am running a test setup with a node_exporters and this service configuration:
--8<--
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
User=root
Restart=on-failure
WorkingDirectory=/var/local/prometheus/node_exporter
ExecStart=/usr/local/bin/prometheus_exporter_pack/node_exporter/node_exporter -web.listen-address
127.0.0.1:9100 \
-collector.filesystem.ignored-fs-types "^(sys|proc|auto)fs$" \
-collector.filesystem.ignored-mount-points "^/(sys|proc|dev)($|/)" \
-log.level warn
[Install]
WantedBy=default.target
-- >8 --
Last night the service died on one server with:
-- 8< --
node_exporter.service - Prometheus Node Exporter
Loaded: loaded (/usr/local/bin/prometheus_exporter_pack/node_exporter/init/node_exporter.service; enabled)
Active: inactive (dead) since Mo 2017-06-26 23:22:19 CEST; 10h ago
Process: 20079 ExecStart=/usr/local/bin/prometheus_exporter_pack/node_exporter/node_exporter -web.listen-address
127.0.0.1:9100 -collector.filesystem.ignored-fs-types ^(sys|proc|auto)fs$ -collector.filesystem.ignored-mount-points ^/(sys|proc|dev)($|/) -log.level warn (code=killed, signal=PIPE)
Main PID: 20079 (code=killed, signal=PIPE)
-- >8 --
The service was not restarted and an alert was fired although there was/is no problem with the server.
While running several hours before the crash this node_exporter was logging problems with diskstat (node_exporter[20079]: time="2017-06-26T23:21:04+02:00" level=error msg="ERROR: diskstats collector failed after 0.002437s: couldn't get diskstats: open /proc/diskstats: no such file or directory" source="node_exporter.go:95") but this did not cause any problem (no crash, all other metrics recorded)
If I understand this correctly SIGPIPE could happen when journald closes stdout on node_exporter. This would correlate to node_exporter constantly logging on this machine, maybe increasing the likelihood.
I am not sure how to make this more robust. Should I configure restart=always or RestartForceExitStatus=SIGPIPE in systemd to force a restart or should node_exporter handle SIGPIPE differently?
Thanks!
Henrik