[Rocks-Discuss] problem with gmond dying

Scott Woods

unread,

Aug 21, 2008, 2:51:22 PM8/21/08

to npaci-rocks...@sdsc.edu

Hi there,

We have a ROCKS 5.0 installation (gmond 3.0.7) on x64 where gmond is
dying periodically. Sometimes it stays up for a week, and sometimes it
only stays up for a few hours. We haven't seen any pattern with time
of day or cluster usage that indicates why it could be failing.

I started gmond by hand with the debug option turned on, and collected
the output. The daemon stayed up for about a week, but died with a
single statement "ioctl error: No such device".

Here is the final snippet from the debug output:

http://pastie.org/257380.txt

Any idea what could be causing this? Google searches didn't turn up
much help. Please let me know if there's any other information that I
can post that would be helpful, or if you think this would be more
appropriate for the ganglia list.

Thanks,
Scott

--
Scott Woods
West Arete Computing
http://westarete.com
sc...@westarete.com

Greg Bruno

unread,

Aug 21, 2008, 3:16:23 PM8/21/08

to Discussion of Rocks Clusters

On Thu, Aug 21, 2008 at 11:51 AM, Scott Woods <sc...@westarete.com> wrote:
> Hi there,
>
> We have a ROCKS 5.0 installation (gmond 3.0.7) on x64 where gmond is dying
> periodically. Sometimes it stays up for a week, and sometimes it only stays
> up for a few hours. We haven't seen any pattern with time of day or cluster
> usage that indicates why it could be failing.
>
> I started gmond by hand with the debug option turned on, and collected the
> output. The daemon stayed up for about a week, but died with a single
> statement "ioctl error: No such device".
>
> Here is the final snippet from the debug output:
>
> http://pastie.org/257380.txt
>
> Any idea what could be causing this? Google searches didn't turn up much
> help. Please let me know if there's any other information that I can post
> that would be helpful, or if you think this would be more appropriate for
> the ganglia list.

i don't have an idea on what could be causing the problem, but you
could run 'gmond' under 'strace' -- that way you'll know what file (or
device) the ioctl is failing on.

- gb

Scott Woods

unread,

Aug 21, 2008, 5:33:01 PM8/21/08

to npaci-rocks...@sdsc.edu

> On Thu, Aug 21, 2008 at 11:51 AM, Scott Woods <scott at

Thanks Greg. I'll see if I can give that a shot. I think I'd better
set up some pretty frequent log rotation to avoid filling up the disk.
I can only imagine how much data one week of strace output will
generate...

By the way, I forgot to mention that this is happening only on the
master node, on an 18-node cluster, with only 4 nodes currently
powered on.

-Scott

Scott Woods

unread,

Aug 28, 2008, 8:15:35 AM8/28/08

to Discussion of Rocks Clusters

OK, I've successfully run strace against our instance of gmond that
seems to die periodically. It looks like it's dying while querying one
of the network interfaces, and it looks likely that it's one of the
network interfaces for VMware. We have two VMware virtual linux
machines (CentOS) running on that node for test environments. Sorry I
didn't mention that originally, it didn't occur to me that it would
impact gmond (I know, how many times has *that* been said about
VMware...).

Here is the output from strace:

http://pastie.org/261615.txt

And here are the network interfaces ("ifconfig -a") on that machine:

http://pastie.org/261618.txt

I downloaded the ganglia source (we're running ganglia 3.0.7) and I
imagine the crash is occurring in the get_ifi_info function in
libmetrics/get_ifi_info.c (search for SIOCGIFFLAGS).

Is anyone here familiar with the ganglia source? Any idea whether I'm
on the right track diagnosing this?

Thanks very much,
Scott

(Sorry, I seem to have lost the original email thread of this
discussion for quoting. This email is a continuation of this thread: )

http://bit.ly/1no4oy

Scott Woods

unread,

Aug 28, 2008, 8:48:30 AM8/28/08

to Discussion of Rocks Clusters

On Aug 28, 2008, at 8:15 AM, Scott Woods wrote:

> OK, I've successfully run strace against our instance of gmond that
> seems to die periodically. It looks like it's dying while querying
> one of the network interfaces, and it looks likely that it's one of
> the network interfaces for VMware. We have two VMware virtual linux
> machines (CentOS) running on that node for test environments. Sorry
> I didn't mention that originally, it didn't occur to me that it
> would impact gmond (I know, how many times has *that* been said
> about VMware...)

I should emphasize that gmond is running on the native OS on the
master node, *not* in the VMware instances. The VMware instances are
just stock CentOS virtual machines that happen to be hosted on the
master node as well. It appears that gmond is tripping over the
network interfaces that VMware creates for interfacing with the
virtual machines.

Thanks,
Scott

Reply all

Reply to author

Forward