mysqld_safe spinning CPU on first mysql stop after install

182 views
Skip to first unread message

Eric Hammond

unread,
Oct 5, 2008, 4:46:57 PM10/5/08
to ec2ubuntu
The following problem has been reported and I am able to reproduce it,
but I haven't figured out a solution yet so I'm asking for help.

1. Start an instance of the latest Ubuntu 8.04 Hardy base install on
http://alestic.com

2. Bring software up to date and install MySQL:

apt-get update && apt-get upgrade -y && apt-get install -y mysql-
server

3. Stop the MySQL server:

/etc/init.d/mysql stop

(OR)

mysqladmin shutdown

=> Though the mysqld process stops, the mysqld_safe wrapper starts
using up all available CPU as seen by the "top" command.

Anybody have ideas on what might be going wrong here or how to track
it down and fix it?

The only workaround we have at this point is to:

killall mysqld_safe

In my tests this problem seems to only happen when mysqld_safe is run
at the point mysql-server installed. If I start mysql again or reboot
so that it restarts, then stopping it works cleanly and mysqld_safe is
not left running, much less taking up all CPU.

This inability to repeat the problem after the first install makes it
difficult to add debugging code to mysqld_safe to figure out where it
is spinning.

--
Eric Hammond
http://www.anvion.com

Terry Jones

unread,
Oct 5, 2008, 6:15:41 PM10/5/08
to ec2u...@googlegroups.com
Hi Eric

>>>>> "Eric" == Eric Hammond <eham...@thinksome.com> writes:
Eric> The following problem has been reported and I am able to reproduce it,
Eric> but I haven't figured out a solution yet so I'm asking for help.

One possible clue is that if you strace -p on the mysqld_safe PID, you'll
see streams of output like this

wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)
wait4(-1, 0xbf9c23c0, 0, NULL) = -1 ECHILD (No child processes)

You'll see there's a wait call in the while true loop in mysqld_safe.

If you run ps aguxww | grep -v grep | grep mysql many times, you'll notice
you never see a mysqld (at least I haven't).

Replacing /usr/sbin/mysqld with a shell script that appends a line to a
file in /tmp shows that it is not being run.

These things imply to me that the mysqld process is never being
re-started. That would cause the wait to have nothing to wait for, it would
cause there to be no $pid_file, and that would cause the while true loop to
spin.


But I don't see the cause. If $ledir/$MYSQLD somehow didn't exist by the
time the loop was hit, that would be nice :-) But that's tested (early)
with a test -x. I don't see anything that could cause a cd.

I wonder if $args might be empty on one call and non-empty on another,
causing different invocations (one with eval, one without). That might
explain why it sometimes spins and sometimes launches mysqld.

I don't know where the logger output or the stdout (from echo "STOPPING
server from pid file $pid_file") are going.

Sorry - I can't put much more time into this... :-) I'd definitely be
printing a few of these things in mysqld_safe in a new image if I were
going to dig further.

I also tried gdb -p PID, to attach to the running mysqld_safe, but saw
little (it was stopped in wait4).

Terry

Terry Jones

unread,
Oct 5, 2008, 7:58:06 PM10/5/08
to ec2u...@googlegroups.com
Hi again Eric

My last mail had an thinko... There's a break in the no pid_file check, not
a continue. So that wouldn't cause the loop to spin.

I started a new instance, did an apt-get -d install mysql-server, extracted
the contents of the .deb file (in /var/cache/apt/archives) and replaced
usr/bin/mysqld_safe with a version that writes some stuff to /tmp (code
attached - it's an ugly hack, sorry).

Then I rebuilt the mysql-server .deb, installed it and its deps and had a
running mysld. I stopped this via /etc/init.d/mysql stop and had the
following in my /tmp output file:

# cat /tmp/safeout
Sun Oct 5 23:37:27 UTC 2008: started.
Sun Oct 5 23:37:27 UTC 2008: args are .
Sun Oct 5 23:37:27 UTC 2008: b4 loop: args='--port=3306 --socket=/var/run/mysqld/mysqld.sock '.
Sun Oct 5 23:37:27 UTC 2008: b4 loop: ledir='/usr/sbin'.
Sun Oct 5 23:37:27 UTC 2008: b4 loop: MYSQLD='mysqld'.
Sun Oct 5 23:37:27 UTC 2008: b4 loop: pid_file='/var/run/mysqld/mysqld.pid'.
Sun Oct 5 23:37:27 UTC 2008: b4 loop: NOHUP_NICENESS='nohup'.
Sun Oct 5 23:37:27 UTC 2008: starting via eval.
Sun Oct 5 23:37:29 UTC 2008: in 2nd loop I=1 numof=1.
Sun Oct 5 23:37:29 UTC 2008: hanging.
Sun Oct 5 23:37:29 UTC 2008: starting via eval.

If you have a look at my script you'll see it should be dumping much more
output if indeed it's looping in some obvious way.

So it does look like there's something weird going on with the wait. BTW, I
still see the same CPU behavior of mysqld_safe and the same thing in strace
with my hacked mysqld_safe.

Sorry for the red herring first time round. Now I really have to stop
thinking about this. I don't even use mysql...

Terry

mysqld_safe

Terry Jones

unread,
Oct 5, 2008, 8:40:59 PM10/5/08
to ec2u...@googlegroups.com
OK, final comment on this.

Changing the if stmt in mysqld_safe that starts mysqld to look like this:

if test -z "$args"
then
$NOHUP_NICENESS $ledir/$MYSQLD $defaults --basedir=$MY_BASEDIR_VERSION --datadir=$DATADIR $USER_OPTION --pid-file=$pid_file --skip-external-locking 2>&1 | $ERR_LOGGER -t mysqld &
else
eval "$NOHUP_NICENESS $ledir/$MYSQLD $defaults --basedir=$MY_BASEDIR_VERSION --datadir=$DATADIR $USER_OPTION --pid-file=$pid_file --skip-external-locking $args 2>&1 | $ERR_LOGGER -t mysqld &"
fi
wait $!


results in mysqld shutting down cleanly and mysqld_safe breaks as expected
in the loop, and exits.

if test ! -f $pid_file # This is removed if normal shutdown
then


echo "STOPPING server from pid file $pid_file"

break
fi


So it seems the eval "..... & wait" was causing the problem. Waiting
explicitly on the PID that was just started fixes it.

So I guess someone needs to convince the mysql people to change mysqld_safe
to be a bit less terse.

I had to do all this by building and installing a new .deb, as described in
my last mail.

Terry

Eric Hammond

unread,
Oct 6, 2008, 5:41:07 AM10/6/08
to ec2ubuntu
Terry:

Thanks for your great investigative work.

As far as I can tell the mysqld_safe code seems to be correct, though
you found a workaround for whatever was misbehaving in the "wait".
Unfortunately, I doubt we'll be able to get this into the Ubuntu
packages.

At this point, I wonder if there is some sort of interaction between
the Ubuntu libc? and Amazon's Xen kernel. This is probably going to
be difficult to track down without Amazon support.

I tried a number of different AMIs going back in time and of different
Ubuntu releases. I built a number of new ones trying different
combinations with and without libc6-xen and tls. All of them
experienced the same problem except the Dapper AMI.

Using "killall mysqld_safe" after the first "mysql stop" is not a
difficult workaround, but it worries me that there is something else
going on here which might affect other processes.

--
Eric Hammond
http://www.anvilon.com

Martin Kleppmann

unread,
Oct 6, 2008, 6:04:36 AM10/6/08
to ec2u...@googlegroups.com
Hi all,

A few days ago I had an issue with the 32-bit ec2onrails AMI (ami-
c9bc58a0), which is based on ec2ubuntu. That AMI is not very heavily
customised, it's mainly a few extra packages pre-installed, certain
users and permissions set up etc. Although my problem was not in
mysqld_safe, the symptoms sound similar:

Running "apt-get install shorewall" on a fresh ami-c9bc58a0 instance
(post security updates) gets stuck at the configure stage; it looks
like the CPU is spinning in the /usr/sbin/update-rc.d script which is
supposed to create the symlinks in /etc/rc?.d.

I didn't have time to investigate it in detail, and I worked around it
by doing a fresh manual install from the ec2ubuntu image. It looked
like the bug should be reproducable though.

The strange thing is that I had the shorewall installation as part of
an automated build script, and it was working fine until recently.
This may just be a hint which supports Eric's idea that there is
something wrong on kernel level...

Martin

On 6 Oct 2008, at 10:41, Eric Hammond wrote:
> Terry:
>
> Thanks for your great investigative work.
>
> As far as I can tell the mysqld_safe code seems to be correct, though
> you found a workaround for whatever was misbehaving in the "wait".
> Unfortunately, I doubt we'll be able to get this into the Ubuntu
> packages.
>
> At this point, I wonder if there is some sort of interaction between
> the Ubuntu libc? and Amazon's Xen kernel. This is probably going to
> be difficult to track down without Amazon support.
>
> I tried a number of different AMIs going back in time and of different
> Ubuntu releases. I built a number of new ones trying different
> combinations with and without libc6-xen and tls. All of them
> experienced the same problem except the Dapper AMI.
>
> Using "killall mysqld_safe" after the first "mysql stop" is not a
> difficult workaround, but it worries me that there is something else
> going on here which might affect other processes.
>
> --
> Eric Hammond
> http://www.anvilon.com

--
Martin Kleppmann
http://www.eptcomputing.com/

Eric Hammond

unread,
Oct 6, 2008, 7:59:55 AM10/6/08
to ec2ubuntu
Martin:

Thanks for the report. Reproducible is good as it may help folks
track down the problem and it will be good to test once we think the
problem is corrected.

--
Eric Hammond
http://www.anvilon.com

Eric Hammond

unread,
Oct 6, 2008, 8:16:07 AM10/6/08
to ec2ubuntu

I ran 25+ instances with different configurations. The test consisted
of:

apt-get update && apt-get upgrade -y && apt-get install -y mysql-
server
/etc/init.d/mysql stop
top

If mysqld_safe was spinning CPU, then it failed, otherwise it passed.

I found that a Hardy AMI built fresh on a 2.6.16 kernel worked
correctly, but failed when run on 2.6.18 or 2.6.21.

The Dapper AMI i-b1a805d8 from http://alestic.com passes on the
default 2.6.21.

The Scalr base AMI ami-51f21638 passed. By default it runs on 2.6.16,
but I also tried it on 2.6.18 and 2.6.21 and it still passed.

ec2-run-instances \
--key KEYPAIR \
--kernel aki-a71cf9ce \
--ramdisk ari-a51cf9cc \
ami-51f21638

It's not built for 2.6.21, so you'll find it takes forever to start up
and nash-hotplug needs to be killed first thing when you log in:

killall nash-hotplug

However, I think there may be hope in trying to find what the key
differences are between Scalar ami-51f21638 and Alestic.com
ami-1cd73375.

If anybody looks at these and comes up with suspicious differences, I
can build new AMIs and test.

I already tried removing the apparmor and aparmor-utils packages
(which won't work anyway because Amazon's kernels are too old).

mariusmarais

unread,
Nov 5, 2008, 4:34:25 AM11/5/08
to ec2ubuntu
Hi all

I'm currently running into this problem in non-EC2 Ubuntu. (Hardy
Desktop, latest packages as of today.)

mysqld_safe spins the CPU if you shut MySQL down after installing it.
All other symptoms exactly as described in earlier posts, so this is
not just a EC2Ubuntu problem.

To reproduce:
apt-get remove --purge mysql-server*
apt-get install -y mysql-server
/etc/init.d/mysql stop
top

Kill mysqld_safe with killall mysqld_safe

See here: https://bugs.launchpad.net/ubuntu/+source/mysql-dfsg-5.0/+bug/105457

Marius

On Oct 6, 2:16 pm, Eric Hammond <ehamm...@thinksome.com> wrote:
> I ran 25+ instances with different configurations.  The test consisted
> of:
>
>     apt-get update && apt-get upgrade -y && apt-get install -y mysql-
> server
>     /etc/init.d/mysql stop
>     top
>
> Ifmysqld_safewas spinning CPU, then it failed, otherwise it passed.
>
> I found that a Hardy AMI built fresh on a 2.6.16 kernel worked
> correctly, but failed when run on 2.6.18 or 2.6.21.
>
> The Dapper AMI i-b1a805d8 fromhttp://alestic.compasses on the
Reply all
Reply to author
Forward
0 new messages