We have a ROCKS 5.0 installation (gmond 3.0.7) on x64 where gmond is
dying periodically. Sometimes it stays up for a week, and sometimes it
only stays up for a few hours. We haven't seen any pattern with time
of day or cluster usage that indicates why it could be failing.
I started gmond by hand with the debug option turned on, and collected
the output. The daemon stayed up for about a week, but died with a
single statement "ioctl error: No such device".
Here is the final snippet from the debug output:
Any idea what could be causing this? Google searches didn't turn up
much help. Please let me know if there's any other information that I
can post that would be helpful, or if you think this would be more
appropriate for the ganglia list.
Thanks,
Scott
--
Scott Woods
West Arete Computing
http://westarete.com
sc...@westarete.com
i don't have an idea on what could be causing the problem, but you
could run 'gmond' under 'strace' -- that way you'll know what file (or
device) the ioctl is failing on.
- gb
Thanks Greg. I'll see if I can give that a shot. I think I'd better
set up some pretty frequent log rotation to avoid filling up the disk.
I can only imagine how much data one week of strace output will
generate...
By the way, I forgot to mention that this is happening only on the
master node, on an 18-node cluster, with only 4 nodes currently
powered on.
-Scott
Here is the output from strace:
And here are the network interfaces ("ifconfig -a") on that machine:
I downloaded the ganglia source (we're running ganglia 3.0.7) and I
imagine the crash is occurring in the get_ifi_info function in
libmetrics/get_ifi_info.c (search for SIOCGIFFLAGS).
Is anyone here familiar with the ganglia source? Any idea whether I'm
on the right track diagnosing this?
Thanks very much,
Scott
(Sorry, I seem to have lost the original email thread of this
discussion for quoting. This email is a continuation of this thread: )
> OK, I've successfully run strace against our instance of gmond that
> seems to die periodically. It looks like it's dying while querying
> one of the network interfaces, and it looks likely that it's one of
> the network interfaces for VMware. We have two VMware virtual linux
> machines (CentOS) running on that node for test environments. Sorry
> I didn't mention that originally, it didn't occur to me that it
> would impact gmond (I know, how many times has *that* been said
> about VMware...)
I should emphasize that gmond is running on the native OS on the
master node, *not* in the VMware instances. The VMware instances are
just stock CentOS virtual machines that happen to be hosted on the
master node as well. It appears that gmond is tripping over the
network interfaces that VMware creates for interfacing with the
virtual machines.
Thanks,
Scott