node exporter process in defunct state, unable to restart

Lakshman Savadamuthu

unread,

Jul 21, 2020, 2:58:33 PM7/21/20

to Prometheus Users

node_exporter.service: main process exited, code=exited, status=1/FAILURE

Tried killing the process/stop/start the node_exporter process, nothing seems working for me. Really appreciate any help here:

[root@mesosagent13 ~]# systemctl status node_exporter

● node_exporter.service - Node Exporter

Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)

Active: deactivating (stop-sigterm) (Result: exit-code) since Tue 2020-07-21 11:51:21 PDT; 16s ago

Process: 35895 ExecStart=/usr/local/bin/node_exporter --collector.filesystem --collector.netdev --collector.cpu --collector.diskstats --collector.mdadm --collector.loadavg --collector.time --collector.logind --collector.textfile.directory=/var/lib/node_exporter/textfile_collector --collector.systemd (code=exited, status=1/FAILURE)

Main PID: 35895 (code=exited, status=1/FAILURE)

Tasks: 2

Memory: 118.2M

CGroup: /system.slice/node_exporter.service

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg="Build context (go=go1.11.5, user=rmcnamara@ln-r....go:157"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg="Enabled collectors:" source="node_exporter.go:97"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - arp" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - bcache" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - bonding" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - conntrack" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - cpu" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - cpufreq" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg=" - diskstats" source="node_exporter.go:104"

Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com systemd[1]: node_exporter.service: main process exited, code=exited, status=1/FAILURE

Hint: Some lines were ellipsized, use -l to show in full.

[root@mesosagent13 ~]#

Christian Hoffmann

unread,

Jul 21, 2020, 3:10:18 PM7/21/20

to Lakshman Savadamuthu, Prometheus Users

Hi,

On 7/21/20 8:58 PM, Lakshman Savadamuthu wrote:
[...]

> Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com
> node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg="
> - diskstats" source="node_exporter.go:104"
>
> Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com

> systemd[1]: *node_exporter.service: main process exited, code=exited,
> status=1/FAILURE*
Looks like node_exporter is exiting directly after starting, right?

Have you tried running node_exporter without systemd? You can also try
setting --log.level=debug so that we can have some more hints about the
possible issue.

I suspect a failure in an early startup stage (parameter parsing, path
validation), but I would have expected an explicit log message about this.

If this issue is also reproducible when starting without systemd and the
debug log level does not lead to anything, I would try running it with
strace.

If this is a systemd-only issue, try verifying users/permissions and
maybe share your unit file.

What version of node_exporter is this?

Kind regards,
Christian

Lakshman Savadamuthu

unread,

Jul 21, 2020, 3:34:08 PM7/21/20

to Prometheus Users

Thanks for the reply Christian.

Looks like the node_exporter is in defunct state, i can't even stop the process now.

Here is the version:

[root@mesosagent13 ~]# /usr/local/bin/node_exporter --version

node_exporter, version 0.17.0 (branch: master, revision: 36e3b2a923e551830b583ecd43c8f9a9726576cf)

[root@mesosagent13 ~]# ps -aef | grep node_exporter

root 8600 61971 0 12:31 pts/0 00:00:00 grep --color=auto node_exporter

prometh+ 53547 1 20 Jun22 ? 6-02:57:16 [node_exporter] <defunct>

[root@mesosagent13 ~]#

Tried killing the process also using pkill -f option, that also didnt help.

Pls let me know if any other option to try it out.

Thanks

Christian Hoffmann

unread,

Jul 21, 2020, 3:48:59 PM7/21/20

to Lakshman Savadamuthu, Prometheus Users

Hi,

On 7/21/20 9:34 PM, Lakshman Savadamuthu wrote:
> Thanks for the reply Christian.
> Looks like the node_exporter is in defunct state, i can't even stop the
> process now.
>
> Here is the version:
>
> [root@mesosagent13 ~]# /usr/local/bin/node_exporter --version
>
> node_exporter, version 0.17.0 (branch: master, revision:
> 36e3b2a923e551830b583ecd43c8f9a9726576cf)

Meanwhile, the latest version is 1.0.1, so updating might be worth a try
(although I don't know of any fixes specific to your issue).

> [root@mesosagent13 ~]# ps -aef | grep node_exporter
>
> root 8600 61971 0 12:31 pts/0 00:00:00 grep --color=auto

> *node_exporter*
>
> prometh+ 53547 1 20 Jun22 ? 6-02:57:16 [*node_exporter*]

> <defunct>
>
> [root@mesosagent13 ~]#
>
> Tried killing the process also using pkill -f option, that also didnt help.

Hrm, this usually sounds like the process invoking node_exporter has not
recognized the exit properly yet. Is this from the start using systemd?
Can you share the unit file?

Or is this from a manual start? Could it be that you had backgrounded
the process using "&" or using Ctrl+Z? If so, try foregrounding it (fg)
so that the shell can properly handle the exit.

You can try to look at what this process was doing lastly by running
cat /proc/53547/stack

But I suspect that it will not lead to anything useful.

I think this may just be a dead process table entry. If nothing helps,
you could reboot. In any case, this shouldn't prevent you from running
further tests (e.g. it should not block the listening port or anything).

Kind regards,
Christian

Lakshman Savadamuthu

unread,

Jul 21, 2020, 4:02:00 PM7/21/20

to Christian Hoffmann, Prometheus Users

just FYI, there are few other hosts in this cluster, where node_exporter is running just fine without any issues.

We have started the process using systemctl command, here is the service file:

# cat /etc/systemd/system/node_exporter.service

[Unit]

Description=Node Exporter

[Service]

User=prometheus

ExecStart=/usr/local/bin/node_exporter --collector.filesystem --collector.netdev --collector.cpu --collector.diskstats --collector.mdadm --collector.loadavg --collector.time --collector.uname --collector.logind --collector.textfile.directory=/var/lib/node_exporter/textfile_collector --collector.systemd

[Install]

WantedBy=default.target

[

Also here is the stack trace:

[root@mesosagent13 ~]# cat /proc/53547/stack

[<ffffffffb8c9e04b>] do_exit+0x6bb/0xa40

[<ffffffffb8c9e44f>] do_group_exit+0x3f/0xa0

[<ffffffffb8caf24e>] get_signal_to_deliver+0x1ce/0x5e0

[<ffffffffb8c2b527>] do_signal+0x57/0x6f0

[<ffffffffb8c2bc32>] do_notify_resume+0x72/0xc0

[<ffffffffb9375124>] int_signal+0x12/0x17

[<ffffffffffffffff>] 0xffffffffffffffff

[root@mesosagent13 ~]#

Christian Hoffmann

unread,

Jul 21, 2020, 5:03:14 PM7/21/20

to Lakshman Savadamuthu, Prometheus Users

Hi,

On 7/21/20 10:01 PM, Lakshman Savadamuthu wrote:
> just FYI, there are few other hosts in this cluster, where node_exporter
> is running just fine without any issues.
> We have started the process using systemctl command, here is the service
> file:
>
> # cat /etc/systemd/system/node_exporter.service
>
> [Unit]
>
> Description=Node Exporter
>
>
> [Service]
>
> User=prometheus
>
> ExecStart=/usr/local/bin/node_exporter --collector.filesystem
> --collector.netdev --collector.cpu --collector.diskstats
> --collector.mdadm --collector.loadavg --collector.time --collector.uname
> --collector.logind
> --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
> --collector.systemd
>
>
> [Install]
>
> WantedBy=default.target
>
> [

^^^ This looks truncated somehow?

> Also here is the stack trace:
>
> [root@mesosagent13 ~]# cat /proc/53547/stack
>
> [<ffffffffb8c9e04b>] do_exit+0x6bb/0xa40
>
> [<ffffffffb8c9e44f>] do_group_exit+0x3f/0xa0
>
> [<ffffffffb8caf24e>] get_signal_to_deliver+0x1ce/0x5e0
>
> [<ffffffffb8c2b527>] do_signal+0x57/0x6f0
>
> [<ffffffffb8c2bc32>] do_notify_resume+0x72/0xc0
>
> [<ffffffffb9375124>] int_signal+0x12/0x17
>
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [root@mesosagent13 ~]#

Sounds like a classical Zombie process example. This means, the parent
(i.e. systemd) is expected to clean this up.
Not sure how it can happen with systemd. Maybe try restarting it
(systemctl daemon-reexec).

Besides that, I suggest continuing the other tests such as running
node_exporter without systemd and with increased debug level. This
should be possible despite the Zombie process.

Kind regards,
Christian

Reply all

Reply to author

Forward