Big problems with 7.1 locking up :-(

Pete French

unread,

Jan 8, 2009, 8:59:42 PM1/8/09

to

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.

So the last two days I have been round upgrading all our servers, knowing
that I had run the system stably on identical hardware for some time.

Since then I have starte seeing machines lock up. This always happens under
heavy disc load. When I bring the machine back up then sometimes it fails
to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been written
to the drive, but never made it onto the disc).

The only time I have seen anything useful on the screen was during one lockup
where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.

Help! :-(

I am now downgrading all the machine to 7.0 as fast as I can - though the
machine I am trying to compile it on has locked up once during the compile
so I havent got anywhere so far.

The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.

Advice ?

-pete.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Guy Helmer

unread,

Jan 9, 2009, 9:50:30 AM1/9/09

to

Pete French wrote:
> I have a number of HP 1U servers, all of which were running 7.0
> perfectly happily. I have been testing 7.1 in it's various incarnations
> for the last couple of months on our test server and it has performed
> perfectly.
>
> So the last two days I have been round upgrading all our servers, knowing
> that I had run the system stably on identical hardware for some time.
>
> Since then I have starte seeing machines lock up. This always happens under
> heavy disc load. When I bring the machine back up then sometimes it fails
> to fsck due to a partialy truncated inode. The locksup appear to
> be disc related - on my mysql msater machine it will come back up with
> files somewhat shorted than those which ahve aready been transmitted to
> the slave (i.e. some data was in memory, and claimed to have been written
> to the drive, but never made it onto the disc).
>
> The only time I have seen anything useful on the screen was during one lockup
> where I got a message about a spin lock being held too long and some
> comment in parentheses about it being a turnstile lock.
>
> Help! :-(
>
> I am now downgrading all the machine to 7.0 as fast as I can - though the
> machine I am trying to compile it on has locked up once during the compile
> so I havent got anywhere so far.
>
> The machines are HP Proliant DL360 G5s - they have an embedded P400i
> RAID controller with a pair of mirrored drives connected. Each one has
> both ethernets connected, bundled using lagg and LACP.
>
>

I can't tell whether my situation is related, but I am seeing lockups on
SMP Supermicro servers with both older (NetBurst-ish) and current Xeon
CPUs. I have been dropping into the kernel debugger and getting lock
information and process backtraces, but so far nothing has been
conclusively identified. I think the issue I'm seeing was introduced
sometime between October 2 and November 24 in the RELENG_7 branch, and I
suppose the next step is to do a binary search for the offending change.

Guy

--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

Mike Tancsa

unread,

Jan 9, 2009, 10:07:25 AM1/9/09

to

At 09:49 AM 1/9/2009, Guy Helmer wrote:

>>RAID controller with a pair of mirrored drives connected. Each one has
>>both ethernets connected, bundled using lagg and LACP.
>>
>>
>I can't tell whether my situation is related, but I am seeing
>lockups on SMP Supermicro servers with both older (NetBurst-ish) and
>current Xeon CPUs. I have been dropping into the kernel debugger
>and getting lock information and process backtraces, but so far
>nothing has been conclusively identified. I think the issue I'm
>seeing was introduced sometime between October 2 and November 24 in
>the RELENG_7 branch, and I suppose the next step is to do a binary
>search for the offending change.

Are you using the same disk controller as Peter ? Do both of you run
with quotas on the file system ? By lockup, do you mean it doesnt
respond to the network either or just anything that needs disk IO ?

---Mike

Pete French

unread,

Jan 9, 2009, 10:16:15 AM1/9/09

to

> Are you using the same disk controller as Peter ? Do both of you run
> with quotas on the file system ? By lockup, do you mean it doesnt
> respond to the network either or just anything that needs disk IO ?

I dont think he can be using yhe same controller, as mine is an
embedded HPO unit. they do make a separate plugin one though - P400
SAS controller.

My symptoms are that the thing locks hard and respionds to nothing, no
keypresses or anything. I am assuming that the disc is the first thing to
go though, ebcause I see data which was being written to a file and a
processes reading from that file to the network. more of the file comes
over the network than makes it phyiscally onto the disc

The only useful error I ever saw was the message about spin
lock / turnstile locks being held for too long.

-pete.

Guy Helmer

unread,

Jan 9, 2009, 10:29:13 AM1/9/09

to

Pete French wrote:
>> Are you using the same disk controller as Peter ? Do both of you run
>> with quotas on the file system ? By lockup, do you mean it doesnt
>> respond to the network either or just anything that needs disk IO ?
>>
>
> I dont think he can be using yhe same controller, as mine is an
> embedded HPO unit. they do make a separate plugin one though - P400
> SAS controller.
>
> My symptoms are that the thing locks hard and respionds to nothing, no
> keypresses or anything. I am assuming that the disc is the first thing to
> go though, ebcause I see data which was being written to a file and a
> processes reading from that file to the network. more of the file comes
> over the network than makes it phyiscally onto the disc
>
> The only useful error I ever saw was the message about spin
> lock / turnstile locks being held for too long.
>
> -pete.
>

OK, perhaps my issue is different then. My symptoms seem to be a hang
from anything that triggers a fork(), such as entering a command at a
shell prompt or entering a user name at the console's login prompt.
Network activity still works -- all the TCP connections stay up until I
drop into the kernel debugger or power cycle.

Guy

--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

_______________________________________________

Robert Blayzor

unread,

Jan 9, 2009, 2:56:38 PM1/9/09

to

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
> I have a number of HP 1U servers, all of which were running 7.0
> perfectly happily. I have been testing 7.1 in it's various
> incarnations
> for the last couple of months on our test server and it has performed
> perfectly.

I noticed a problem with 7.0 on a couple of Dell servers. Not sure if
this is related but when our system "froze" the box was pingable, and
you could switch virtual consoles... however, you could not type
anything on the screen or connect to any sockets. Num-lock would
still work so the box wasn't solidly frozen. This used to happen a
couple of times every week or two. We've since then compiled the
kernel under the BSD scheduler to rule that out, and so far so good.
(our box was a Dell PE1750, 2GB of RAM, amr RAID controller, bge
network driver) The primary application was just ntpd and apache with
mpm_worker & threads.

Since ULE is now default in 7.1 and not in 7.0, perhaps you can try
that?

--
Robert Blayzor, BOFH
INOC, LLC
rbla...@inoc.net
http://www.inoc.net/~rblayzor/

Pete French

unread,

Jan 9, 2009, 4:45:40 PM1/9/09

to

> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try
> that?

Actually you might be on to something there.... one of the main differences
between out test GL360 and the live ones is that the test one has less
cores in it, and is under less load. So multiprocessing problems may well
show up on the live where they wont on the test box. I shall try
building a kernel with the BSD scheduler adn see what happens there.
probbaly not today, as am loathe to cause anymore downtime right now.

thanks,

-pete.

Garance A Drosihn

unread,

Jan 9, 2009, 10:20:46 PM1/9/09

to

At 1:58 AM +0000 1/9/09, Pete French wrote:
>I have a number of HP 1U servers, all of which were running 7.0
>perfectly happily. I have been testing 7.1 in it's various incarnations
>for the last couple of months on our test server and it has performed
>perfectly.
>

>So the last two days I have been round upgrading all our servers, knowing
>that I had run the system stably on identical hardware for some time.
>
>Since then I have starte seeing machines lock up. This always happens
>under heavy disc load. When I bring the machine back up then sometimes
>it fails to fsck due to a partialy truncated inode. The locksup appear

>to be disc related [...]

One of my friends is also having trouble with lockups on two machines
he had upgraded to 7.1. Also seems to be related to heavy disk I/O,
although I'm not sure the symptoms are the same as what you report.
Both machines had been running 7.0-release without trouble. On at
least one of the systems, he's also working with (what I consider)
very large file systems (over 2 TB). Both machines are using a 3ware
controller with its RAID.

I realize that isn't much to go on, but it suggests that there is
some problem wider than just your (Pete's) usage. I think his
situation is such that lockups like this are simply not acceptable,
and the last I heard he was reverting back to 7.0-release.

--
Garance Alistair Drosehn = g...@gilead.netel.rpi.edu
Senior Systems Programmer or g...@freebsd.org
Rensselaer Polytechnic Institute or dro...@rpi.edu

Garance A Drosihn

unread,

Jan 9, 2009, 10:27:53 PM1/9/09

to

At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

>On Jan 8, 2009, at 8:58 PM, Pete French wrote:
>>I have a number of HP 1U servers, all of which were running 7.0
>>perfectly happily. I have been testing 7.1 in it's various incarnations
>>for the last couple of months on our test server and it has performed
>>perfectly.
>
>

>I noticed a problem with 7.0 on a couple of Dell servers. [...]

>We've since then compiled the kernel under the BSD scheduler to rule
>that out, and so far so good.
>

>Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?

FWIW, the other guy I know who is having this problem had already
switched to using ULE under 7.0-release, and did not have any
problems with it. So *his* problem was probably not related to
SCHED_ULE, unless something has recently changed there.

Turns out he hasn't reverted back to 7.0-release just yet, so he's
going to try SCHED_4BSD and see if that helps his situation.

Pete French

unread,

Jan 10, 2009, 10:50:29 AM1/10/09

to

> FWIW, the other guy I know who is having this problem had already
> switched to using ULE under 7.0-release, and did not have any
> problems with it. So *his* problem was probably not related to
> SCHED_ULE, unless something has recently changed there.

Well, one of my machines just locked up again, even with SCHED_4BSD
on it, so I am now thinking it is unrelated.

The machine has completely locked - no response to pings, no
response to keypresses, nor to the power button. There is nothing
printed on the console - it is just sitting there with a login prompt :-(

This is really not good - these are extremely common servers after all, and
I am just running bog standard 7.1 with apache and mysql. This is happening
across several different servers, all of which are slight variants on
the DL360, so I dont think it is something perculiar to me.

-pete.

helioc...@gmail.com

unread,

Jan 10, 2009, 9:08:33 PM1/10/09

to

I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
deadlock, as it was triggered by lighttpd doing kern_sendfile, and
never returning. The side effects (being unable to create processes,
etc) is similar.

My kernconf is below, try building the kernel, and send an email
containing the backtrace from any process that has blocked (in my
case, lighttpd attempting to sendfile a large amount of data to php
fastcgi triggered it, but that's a guess on my part). Note that this
includes witness, and invariants, so performance will be hit. Also,
enable watchdogd, and add -e 'ls -al /etc' to it's flags. It should
drop you to a debugger with a backtrace within a few seconds of the
lock being triggered, and it should output a backtrace and any
invariant/witness lock warnings. Obviously if you don't have a serial
or local console, don't do this.

include GENERIC
ident DEBUG
options KDB
options DDB
options SW_WATCHDOG
options DEBUG_VFS_LOCKS
options INVARIANTS
options WITNESS

Pete French

unread,

Jan 11, 2009, 7:46:38 AM1/11/09

to

> I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
> deadlock, as it was triggered by lighttpd doing kern_sendfile, and
> never returning. The side effects (being unable to create processes,
> etc) is similar.

Interesting - did you get any responses from anyone else regarding
this ? My last box which locked up was essentialy idle, so I am very
surprised by all of this - also none of the heavilt loaded machines
(i.e. the actual webservers) have locked up.

I am also surprised that this isn't more widely reported, as
the hardware is very common. The only oddity with ym compile
is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
I will remove it anyway, just so I am actually building a completely
vanilla amd64. That way I should have what everyone else has, and since
I don't see anyone else saying they have isues then maybe mine will
go away too (fingers crossed)

> My kernconf is below, try building the kernel, and send an email
> containing the backtrace from any process that has blocked (in my

OK, will do. I can try this on the one non-essential box which
locked up yesterday. I don't know how long it will before it
locks up again, but will see if I can do some things to provoke it.

Pete French

unread,

Jan 11, 2009, 11:28:37 AM1/11/09

to

> My kernconf is below, try building the kernel, and send an email
> containing the backtrace from any process that has blocked (in my

Well, I havent managed to get a backtrace, but immediately upon
booting the system halts with the following:

http://www.twisted.org.uk/~pete/71_lor1.jpg

Interestingly, if I try and boot into safe mode then it will not
even get that far:

http://www.twisted.org.uk/~pete/71_safe1.jpg

Am going to try and backtrace that now to see what I can get. Unfortunately
I can only provide screen captures rather than actual text output from
this due to having to go via a Mac running RDP thought an ssh tunnel
to a Windows box and then using IE to go to the iLO :-) Convoluted,
but it works...

Dylan Cochran

unread,

Jan 11, 2009, 2:02:33 PM1/11/09

to

On Sun, Jan 11, 2009 at 11:27 AM, Pete French
<petef...@ticketswitch.com> wrote:
>> My kernconf is below, try building the kernel, and send an email
>> containing the backtrace from any process that has blocked (in my
>
> Well, I havent managed to get a backtrace, but immediately upon
> booting the system halts with the following:
>
> http://www.twisted.org.uk/~pete/71_lor1.jpg

Not Found

Pete French

unread,

Jan 11, 2009, 2:11:52 PM1/11/09

to

> Not Found

sorry, see the subsequent email, there are more links there to working PNG's

-pete.

Garrett Cooper

unread,

Jan 11, 2009, 2:24:09 PM1/11/09

to

On Sun, Jan 11, 2009 at 4:45 AM, Pete French
<petef...@ticketswitch.com> wrote:
>> I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
>> deadlock, as it was triggered by lighttpd doing kern_sendfile, and
>> never returning. The side effects (being unable to create processes,
>> etc) is similar.
>
> Interesting - did you get any responses from anyone else regarding
> this ? My last box which locked up was essentialy idle, so I am very
> surprised by all of this - also none of the heavilt loaded machines
> (i.e. the actual webservers) have locked up.
>
> I am also surprised that this isn't more widely reported, as
> the hardware is very common. The only oddity with ym compile
> is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
> I will remove it anyway, just so I am actually building a completely
> vanilla amd64. That way I should have what everyone else has, and since
> I don't see anyone else saying they have isues then maybe mine will
> go away too (fingers crossed)
>

>> My kernconf is below, try building the kernel, and send an email
>> containing the backtrace from any process that has blocked (in my
>

> OK, will do. I can try this on the one non-essential box which
> locked up yesterday. I don't know how long it will before it
> locks up again, but will see if I can do some things to provoke it.
>
> -pete.

Intel suggests nocona for x86_64 platforms and prescott for x86
(i386) based platforms on the 4.2 line, because they best matched the
cache size and featureset of the Core2 processors.

I don't think that core2 support was fully completed in 4.2 (in
fact I believe it was just started), and I don't think that our
binutils supports it properly.

Some thoughts,
-Garrett

Claus Guttesen

unread,

Jan 12, 2009, 6:29:13 AM1/12/09

to

>> I am also surprised that this isn't more widely reported, as
>> the hardware is very common. The only oddity with ym compile
>> is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
>> I will remove it anyway, just so I am actually building a completely
>> vanilla amd64. That way I should have what everyone else has, and since
>> I don't see anyone else saying they have isues then maybe mine will
>> go away too (fingers crossed)
>>

> Intel suggests nocona for x86_64 platforms and prescott for x86
> (i386) based platforms on the 4.2 line, because they best matched the
> cache size and featureset of the Core2 processors.
>
> I don't think that core2 support was fully completed in 4.2 (in
> fact I believe it was just started), and I don't think that our
> binutils supports it properly.
>
> Some thoughts,
> -Garrett

I've updagraded a test-webserver to 7.1 when it was released. After a
few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
has been running without any problems. The webserver is not heavily
loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
runs fine.

If the reported lockup is due to i/o a buildworld will not be able to
reproduce it.

It has performed a buildworld without problems and I'll be doing some
buildworlds throughout the day.

This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
build-in p200-controller with 64 MB ram.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

Claus Guttesen

unread,

Jan 12, 2009, 6:38:59 AM1/12/09

to

> I've updagraded a test-webserver to 7.1 when it was released. After a
> few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
> has been running without any problems. The webserver is not heavily
> loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
> runs fine.
>
> If the reported lockup is due to i/o a buildworld will not be able to
> reproduce it.
>
> It has performed a buildworld without problems and I'll be doing some
> buildworlds throughout the day.
>
> This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
> build-in p200-controller with 64 MB ram.

Forgot to add that CPUTYPE=nocona in /etc/make.conf.

Claus Guttesen

unread,

Jan 12, 2009, 9:13:49 AM1/12/09

to

>> It has performed a buildworld without problems and I'll be doing some
>> buildworlds throughout the day.
>>
>> This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
>> build-in p200-controller with 64 MB ram.

I've performed five buildworlds decrementing -j from 16 to 6 and I
can't lock up the server.

Pete French

unread,

Jan 12, 2009, 9:36:13 AM1/12/09

to

> I've performed five buildworlds decrementing -j from 16 to 6 and I
> can't lock up the server.

Mine never lock up doing buildworlds either. They only lock up when they are
sitting there more of less idle! The machines which have never locked up
are the webservers, which are fairly heavlt loaded. The machine which locks
up the most frequently is a box sitting there doing nothing but DNS, which is
the most lightly loaded of the lot.

I am going to roll back to 7.0 on all of the HP machines now, having
had yet another day of rebooting locked up machines. I will leave one
running 7.1 with the debug options in the kernel to try and get some
useful results out of this. All the machines are now running GENERIC with
no specail optimisations, CPU types or anything like that. Absolutely out
of the box vanilla 7.1/amd64 as far as I know :-(

-pete.

Robert Watson

unread,

Jan 12, 2009, 9:56:38 AM1/12/09

to

On Fri, 9 Jan 2009, Garance A Drosihn wrote:

> At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:
>> On Jan 8, 2009, at 8:58 PM, Pete French wrote:
>>> I have a number of HP 1U servers, all of which were running 7.0 perfectly
>>> happily. I have been testing 7.1 in it's various incarnations for the last
>>> couple of months on our test server and it has performed perfectly.
>>
>> I noticed a problem with 7.0 on a couple of Dell servers. [...] We've
>> since then compiled the kernel under the BSD scheduler to rule that out,
>> and so far so good.
>>
>> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?
>

> FWIW, the other guy I know who is having this problem had already switched
> to using ULE under 7.0-release, and did not have any problems with it. So
> *his* problem was probably not related to SCHED_ULE, unless something has
> recently changed there.
>

> Turns out he hasn't reverted back to 7.0-release just yet, so he's going to
> try SCHED_4BSD and see if that helps his situation.

Scheduler changes always come with some risk of exposing bugs that have
existed in the code for a long time but never really manifested themselves.
ULE is well shaken-out, having been under development for at least five years,
but it is possible that some problems will become visible as a result of the
switch. I would encourage people to stick with ULE, but if you're having a
stability problem then experimenting with scheduler as a variable that could
be triggering the problem may well be useful to help track down the bug. Most
of the time the bugs will not be in ULE itself, rather, triggered because ULE
will change the ordering or balancing of work in the system, so we should try
to avoid situations where people switch to 4BSD from ULE and stick with it
rather than getting the underlying problem fixed!

Robert N M Watson
Computer Laboratory
University of Cambridge

Robert Watson

unread,

Jan 12, 2009, 10:01:30 AM1/12/09

to

On Sat, 10 Jan 2009, Pete French wrote:

>> FWIW, the other guy I know who is having this problem had already switched
>> to using ULE under 7.0-release, and did not have any problems with it. So
>> *his* problem was probably not related to SCHED_ULE, unless something has
>> recently changed there.
>

> Well, one of my machines just locked up again, even with SCHED_4BSD on it,
> so I am now thinking it is unrelated.
>
> The machine has completely locked - no response to pings, no response to
> keypresses, nor to the power button. There is nothing printed on the console
> - it is just sitting there with a login prompt :-(
>
> This is really not good - these are extremely common servers after all, and
> I am just running bog standard 7.1 with apache and mysql. This is happening
> across several different servers, all of which are slight variants on the
> DL360, so I dont think it is something perculiar to me.

I'm not sure if you've done this already, but the normal suggestions apply:
have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
any results / panics / etc result? Sometimes these debugging tools are able
to convert hangs into panics, which gives us much more ability to debug them.
If it still hangs rather than panicking, are you able to break into the
debugger on the console? If you're using a video console and not able to get
to the debugger, would it be possible to configure a serial console and use
that -- serial breaks are often more successful at getting to the debugger
than keyboard breaks. Likewise, I'm not sure if this hardware has an NMI
button -- some HP servers have one on the motherboard that you can press --
but that is also potentially a way to get into the debugger the analyze the
crash.

Pete French

unread,

Jan 12, 2009, 10:17:31 AM1/12/09

to

> I'm not sure if you've done this already, but the normal suggestions apply:
> have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
> any results / panics / etc result? Sometimes these debugging tools are able
> to convert hangs into panics, which gives us much more ability to debug them.

I did, but it turns out I had an incorrect option in there which made the
data I got not relevent. I now have another machine running a kernel
with the following config:

include GENERIC
ident DEBUG

options KDB
options DDB
options SW_WATCHDOG
options DEBUG_VFS_LOCKS

options MUTEX_DEBUG
options WITNESS
options LOCK_PROFILING
options INVARIANTS
options INVARIANT_SUPPORT
options DIAGNOSTIC

Those should enable me to get some useful output I hope.

> If it still hangs rather than panicking, are you able to break into the
> debugger on the console? If you're using a video console and not able to get
> to the debugger, would it be possible to configure a serial console and use

I cant add a sserial console - I am remote enough from most of
these machines (Slough) and very remote from the test box (its in the USA!)
so I cant get to them physicly. But I do have iLo which lets me use the
console and gives me a bit of access to the front. I will check for NMI.

Just had another lockup here - my working day has become a succession of
running round rebooting servers though iLo at the moment.

Will get back to you when the debug one has crashed - I could possibly
give you direct access to the iLo console on that if you need it ?

-pete.

Garance A Drosihn

unread,

Jan 12, 2009, 1:57:01 PM1/12/09

to

At 2:55 PM +0000 1/12/09, Robert Watson wrote:
>On Fri, 9 Jan 2009, Garance A Drosihn wrote:
>
>>At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:
>>>On Jan 8, 2009, at 8:58 PM, Pete French wrote:
>>>>I have a number of HP 1U servers, all of which were running 7.0
>>>>perfectly happily. I have been testing 7.1 in it's various
>>>>incarnations for the last couple of months on our test server and
>>>>it has performed perfectly.
>>>
>>>I noticed a problem with 7.0 on a couple of Dell servers. [...]
>>>We've since then compiled the kernel under the BSD scheduler to
>>>rule that out, and so far so good.
>>>
>>>Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?
>>

>>FWIW, the other guy I know who is having this problem had already
>>switched to using ULE under 7.0-release, and did not have any
>>problems with it. So *his* problem was probably not related to
>>SCHED_ULE, unless something has recently changed there.
>>

>>Turns out he hasn't reverted back to 7.0-release just yet, so he's
>>going to try SCHED_4BSD and see if that helps his situation.
>
>Scheduler changes always come with some risk of exposing bugs that
>have existed in the code for a long time but never really manifested
>themselves. ULE is well shaken-out, having been under development
>for at least five years, but it is possible that some problems will
>become visible as a result of the switch. I would encourage people
>to stick with ULE, but if you're having a stability problem then
>experimenting with scheduler as a variable that could be triggering
>the problem may well be useful to help track down the bug.

Just to followup on this: My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems. The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE. At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away. So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging problems.
So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up). The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading. The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

"its a 3ware 9500 12 port model, the boot drive is connected to
an ICH6 in IDE mode, and yes, I've run it in single, single with
hyper threading, and 8 way mode. All 64 bit."

We still have no idea where the problem really is. For all we know,
someone spilled a Pepsi on it when he wasn't looking...

--
Garance Alistair Drosehn = g...@gilead.netel.rpi.edu
Senior Systems Programmer or g...@freebsd.org
Rensselaer Polytechnic Institute or dro...@rpi.edu

Pete French

unread,

Jan 12, 2009, 2:01:11 PM1/12/09

to

> I'm not sure if you've done this already, but the normal suggestions apply:
> have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
> any results / panics / etc result? Sometimes these debugging tools are able
> to convert hangs into panics, which gives us much more ability to debug them.

OK, I have now had a machine hand again, with the correct debug options in
the kernel. The screen looked like this when I went to restart it:

http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear
to me console messages after the lock order reversal - is that normal ?

The machine did stay up for a signifanct amount of time before doing this. I
notice that it is more or less identical to the one I posted whenI
had WITNESS_KDB in the kernel too, so maybe those results arent
entirely suprious after all ?

Given it hasnt dropped to a debugger, is there anything else I can try ?

-pete.

Pete French

unread,

Jan 12, 2009, 2:02:25 PM1/12/09

to

> Just to followup on this: My friend did switch back to a 7.1 kernel with
> SCHED_4BSD, and he still ran into problems. The error messages weren't

Acually, I dont know if I posted it, but that was the same for me too.
The scheduler makes no difference, nor do CPU copile settings.

Tomas Randa

unread,

Jan 12, 2009, 4:04:52 PM1/12/09

to

Hello,

I have similar problems. The last "good" kernel I have from stable
brach, october the 8. Then in next upgrade, I saw big problems with
performance.
I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a
lot of time with status "waiting for opening table" or "waiting for
close tables"

I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard,
areca SATA controller. Could not be problem in "da" device for example?

Thanks Tomas Randa

> Just to followup on this: My friend did switch back to a 7.1 kernel with
> SCHED_4BSD, and he still ran into problems. The error messages weren't

> the same, but errors did happen in the same high disk-I/O situations as
> the lockup happened with SCHED_ULE. At this point he's fallen back to
> the 7.0-kernel that he had been running (which also has SCHED_ULE), and
> all the problems have gone away. So at the moment he's running with a
> 7.0-ish kernel and the 7.1-release userland, without the hanging
> problems.
> So the problem is something in the kernel, but it is *NOT* the scheduler
> (at least, not in his case).
>
> He is not eager to do a whole lot of experiments to track down the
> problem, since this is happening on busy production machines and he
> can't afford to have a lot of downtime on them (especially now that the
> semester at RPI has started up). The systems have some large (2 TB)
> filesystems on them, and the lockups occur in high disk-I/O situations.
> He's seeing the problem on one system which is a dual CPU quad-core
> xeon, and another which is a 64 bit P4 with hyperthreading. The one
> thing in common between the two setups is that the boot drives + a
> 3ware controller (with its array of RAID disks) is moved from one
> machine to the other one:
>
> "its a 3ware 9500 12 port model, the boot drive is connected to
> an ICH6 in IDE mode, and yes, I've run it in single, single with
> hyper threading, and 8 way mode. All 64 bit."
>
> We still have no idea where the problem really is. For all we know,
> someone spilled a Pepsi on it when he wasn't looking...
>

Claus Guttesen

unread,

Jan 12, 2009, 4:33:41 PM1/12/09

to

> I have similar problems. The last "good" kernel I have from stable brach,
> october the 8. Then in next upgrade, I saw big problems with performance.
> I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.
>
> Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot
> of time with status "waiting for opening table" or "waiting for close
> tables"
>
> I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca
> SATA controller. Could not be problem in "da" device for example?

It was mentioned previous in this thread that CPUTYPE could be an
issue. Did you change this if you customized your kernel?

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

Robert Watson

unread,

Jan 12, 2009, 7:05:33 PM1/12/09

to

On Mon, 12 Jan 2009, Tomas Randa wrote:

> I have similar problems. The last "good" kernel I have from stable brach,
> october the 8. Then in next upgrade, I saw big problems with performance. I
> tried ULE, 4BSD etc, but nothing helps, only downgrading system back.
>
> Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot
> of time with status "waiting for opening table" or "waiting for close
> tables"
>
> I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca
> SATA controller. Could not be problem in "da" device for example?

So far, this sounds like a different problem than the one others have been
posting about, which involves full system freezes rather than specific
processes wedging or responding poorly. I'd suggest starting by using
"procstat -k" on the process ID to look at where specific threads are waiting
in the kernel. Is it simply that MySQL is being unreasonably slow in certain
situations, or does it actually entirely stop operating?

If you're able to narrow down the date on the 7.x branch where the problem
you're experiencing "begins", that would be most helpful. I'd suggest leaving
your userspace on the 8th october, and sliding the kernel forward in a binary
search until you've narrowed it down a bit. Obviously, this takes a bit of
patience, but narrowing it down could be quite informative.

Robert N M Watson
Computer Laboratory
University of Cambridge

>

Robert Watson

unread,

Jan 12, 2009, 7:10:58 PM1/12/09

to

On Mon, 12 Jan 2009, Pete French wrote:

>> I'm not sure if you've done this already, but the normal suggestions apply:
>> have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
>> any results / panics / etc result? Sometimes these debugging tools are
>> able to convert hangs into panics, which gives us much more ability to
>> debug them.
>
> OK, I have now had a machine hand again, with the correct debug options in
> the kernel. The screen looked like this when I went to restart it:
>
> http://toybox.twisted.org.uk/~pete/71_lor2.png
>
> It had not, however, dropped into any kind of debugger. Also there appear to
> me console messages after the lock order reversal - is that normal ?

Lock order reversals are warnings of potential deadlock due to a lock cycle,
but deadlocks may not actually result, either because it's a false positive
(some locking construct that is deadlock free but involves lock cycles), or
because a cycle didn't actually form. The message is suggestive, but if you
have significant system activity after the message, then it may be unrelated.

> The machine did stay up for a signifanct amount of time before doing this. I
> notice that it is more or less identical to the one I posted whenI had
> WITNESS_KDB in the kernel too, so maybe those results arent entirely
> suprious after all ?
>
> Given it hasnt dropped to a debugger, is there anything else I can try ?

Features like WITNESS and INVARIANTS may change the timing of the kernel
making certain race conditions less likely; I'd run with them for a bit and
see if you can reproduce the hang with them present, as they will make
debugging the problem a lot easier, if it's possible.

Robert N M Watson
Computer Laboratory
University of Cambridge

Robert Watson

unread,

Jan 12, 2009, 7:13:59 PM1/12/09

to

On Mon, 12 Jan 2009, Garance A Drosihn wrote:

> He is not eager to do a whole lot of experiments to track down the problem,
> since this is happening on busy production machines and he can't afford to
> have a lot of downtime on them (especially now that the semester at RPI has
> started up). The systems have some large (2 TB) filesystems on them, and
> the lockups occur in high disk-I/O situations. He's seeing the problem on
> one system which is a dual CPU quad-core xeon, and another which is a 64 bit
> P4 with hyperthreading. The one thing in common between the two setups is
> that the boot drives + a 3ware controller (with its array of RAID disks) is
> moved from one machine to the other one:

I think playing the combinatorics game on compile-time flags, kernel features,
etc, is probably not the best way to go about debugging this. Instead, I'd
debug this as a kernel hang by breaking into the debugger once it occurs, if
possible, and ideally on a serial console. Often times hangs can be debugged
looking solely at DDB output, or if possible, a crash dump.

Doug Barton

unread,

Jan 13, 2009, 4:12:35 AM1/13/09

to

Pete French wrote:
> Mine never lock up doing buildworlds either. They only lock up when they are
> sitting there more of less idle! The machines which have never locked up
> are the webservers, which are fairly heavlt loaded. The machine which locks
> up the most frequently is a box sitting there doing nothing but DNS, which is
> the most lightly loaded of the lot.

Silly question but do you have powerd enabled on that server? If so,
does disabling it help? Also do you have any of these in /etc/rc.conf
(i.e., they are not the same as the default values in
/etc/defaults/rc.conf):
performance_cx_lowest="HIGH" # Online CPU idle state
performance_cpu_freq="NONE" # Online CPU frequency
economy_cx_lowest="HIGH" # Offline CPU idle state
economy_cpu_freq="NONE" # Offline CPU frequency

Doug

--

This .signature sanitized for your protection

Claus Guttesen

unread,

Jan 13, 2009, 4:52:01 AM1/13/09

to

>> Mine never lock up doing buildworlds either. They only lock up when they are
>> sitting there more of less idle! The machines which have never locked up
>> are the webservers, which are fairly heavlt loaded. The machine which locks
>> up the most frequently is a box sitting there doing nothing but DNS, which is
>> the most lightly loaded of the lot.

The server has been idle for a day now and is up and running. I have
then copied a file to generate some i/o and it copies without
problems.

for ((a=0;a<10;a++))
do
cp netbeans-6.5-ml-macosx.dmg ${a}.dmg &
done

I can't (fortunately) make it lock up. I have a DL360 G5 which is
unused atm. and can test on it if needed.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

Gavin Atkinson

unread,

Jan 13, 2009, 5:32:09 AM1/13/09

to

On Mon, 2009-01-12 at 19:00 +0000, Pete French wrote:
> > I'm not sure if you've done this already, but the normal suggestions apply:
> > have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
> > any results / panics / etc result? Sometimes these debugging tools are able
> > to convert hangs into panics, which gives us much more ability to debug them.
>
> OK, I have now had a machine hand again, with the correct debug options in
> the kernel. The screen looked like this when I went to restart it:
>
> http://toybox.twisted.org.uk/~pete/71_lor2.png
>
> It had not, however, dropped into any kind of debugger. Also there appear
> to me console messages after the lock order reversal - is that normal ?
>

> The machine did stay up for a signifanct amount of time before doing this. I
> notice that it is more or less identical to the one I posted whenI
> had WITNESS_KDB in the kernel too, so maybe those results arent
> entirely suprious after all ?
>
> Given it hasnt dropped to a debugger, is there anything else I can try ?

Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
over the serial line?

Gavin

Pete French

unread,

Jan 13, 2009, 6:47:00 AM1/13/09

to

> Lock order reversals are warnings of potential deadlock due to a lock cycle,
> but deadlocks may not actually result, either because it's a false positive
> (some locking construct that is deadlock free but involves lock cycles), or
> because a cycle didn't actually form. The message is suggestive, but if you
> have significant system activity after the message, then it may be unrelated.

Its hard to tell in this case as there are no timestamps, so I cant
see if there is any activity after the lockup.

> Features like WITNESS and INVARIANTS may change the timing of the kernel
> making certain race conditions less likely; I'd run with them for a bit and
> see if you can reproduce the hang with them present, as they will make
> debugging the problem a lot easier, if it's possible.

Uh, the above *was* me reproducing the hang with them present ;-)) It
quite happily hangs with thoise things in the kernel - indeed the next
hang was immediately after I rebooted the machine. But even with WITNESS
and INVARIANTS and all the rest it does not drop to a debugger, it
simply locks up.

That machine is currently turned off, but still has 7.1 installed. What
would you like me to try now ? I have a lockup I can reproduce pretty
reliably now (just wait and it will always lock up). I also found that
my other 7.1 box locks up fairly reliably when doing a buildworld.

The only similarily between these two machines and the ones which dont
lock up is that these are serving DNS. The others don't. Note that all
the hardware is identical, as is the installed software and the configuration.

I am at a total loss...

-pete.

Pete French

unread,

Jan 13, 2009, 6:51:02 AM1/13/09

to

> It was mentioned previous in this thread that CPUTYPE could be an
> issue. Did you change this if you customized your kernel?

Actually, I think thats been ruled out as a possible cause, along
with the scheduler. Certainly I have tried it both ways and
there is no difference, and I think i saw that the others had too.

Robert Watson

unread,

Jan 13, 2009, 7:44:00 AM1/13/09

to

On Tue, 13 Jan 2009, Pete French wrote:

>> Features like WITNESS and INVARIANTS may change the timing of the kernel
>> making certain race conditions less likely; I'd run with them for a bit and
>> see if you can reproduce the hang with them present, as they will make
>> debugging the problem a lot easier, if it's possible.
>
> Uh, the above *was* me reproducing the hang with them present ;-)) It quite
> happily hangs with thoise things in the kernel - indeed the next hang was
> immediately after I rebooted the machine. But even with WITNESS and
> INVARIANTS and all the rest it does not drop to a debugger, it simply locks
> up.
>
> That machine is currently turned off, but still has 7.1 installed. What
> would you like me to try now ? I have a lockup I can reproduce pretty
> reliably now (just wait and it will always lock up). I also found that my
> other 7.1 box locks up fairly reliably when doing a buildworld.
>
> The only similarily between these two machines and the ones which dont lock
> up is that these are serving DNS. The others don't. Note that all the
> hardware is identical, as is the installed software and the configuration.

If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing
ctrl-alt-break on the console to see if you can drop into the debugger, or
issue a serial break on a serial console. For somewhat complicated reasons to
explain, serial breaks are more effective at getting into the debugger, so are
preferable -- also because you can more easily log output from the debugger.

If you are able to get into the debugger, the normal commands would be most
helpful, especially if you can log the results:

ps
show lockedvnods
show alllocks

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 13, 2009, 8:44:40 AM1/13/09

to

> Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
> over the serial line?

No, ctrl-alt-esc doesnt work, and there is no serial line on the machine (not
that I can access anyway)

-pete.

Pete French

unread,

Jan 13, 2009, 9:12:26 AM1/13/09

to

> Silly question but do you have powerd enabled on that server? If so,
> does disabling it help? Also do you have any of these in /etc/rc.conf
> (i.e., they are not the same as the default values in
> /etc/defaults/rc.conf):
> performance_cx_lowest="HIGH" # Online CPU idle state
> performance_cpu_freq="NONE" # Online CPU frequency
> economy_cx_lowest="HIGH" # Offline CPU idle state
> economy_cpu_freq="NONE" # Offline CPU frequency

No, none of those. My rc.conf is below. The only slightly unusual thing I
am doing is using lagg rather than the interfaces directly I guess, but
that has worked fine for ages.

-pete.

hostname="florentine.rattatosk"
cloned_interfaces="lagg0"
network_interfaces="lo0 bce0 bce1 lagg0"
ifconfig_bce0="up"
ifconfig_bce1="up"
ifconfig_lagg0="laggproto lacp laggport bce0 laggport bce1"

ipv4_addrs_lagg0="10.48.19.0/16 10.48.19.229/16 10.48.19.223/16 10.48.19.243/16 10.48.19.226/16 10
.48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16 10.48.19.232/16 10.4
8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16"

defaultrouter="10.48.0.9"

inetd_enable="YES"
sshd_enable="YES"

dhcpd_enable="YES"
dhcpd_ifaces="lagg0"
dhcpd_flags="-q"
dhcpd_conf="/usr/local/etc/dhcpd.conf"
dhcpd_withumask="022"

nfs_client_enable="YES"
nfs_server_enable="YES"
portmap_enable="YES"
rpcbind_enable="YES"

named_enable="YES"
pdns_enable="YES"
pdns_recursor_enable="NO"

mysql_enable="YES"

apache22_http_accept_enable="YES"
apache22_enable="YES"

ntpd_enable="YES"
ntpd_sync_on_start="YES"

exim_enable="YES"
exim_flags="-bd -q10m"
sendmail_enable="NONE"
sendmail_submit_enable="NO"
sendmail_outbound_enable="NO"
sendmail_msp_queue_enable="NO"

Pete French

unread,

Jan 13, 2009, 10:18:38 AM1/13/09

to

> I can't (fortunately) make it lock up. I have a DL360 G5 which is
> unused atm. and can test on it if needed.

Would it be possible to install that under amd64 and hammer it with
DNS requests ? I have been trying to think what the difference might be
between my webservers and the machines which are freezing, and the opnly
one I an come up with is UDP traffic as the locking machines are serving
DNS and also NFS.

-pete.
,.

Nathan Way

unread,

Jan 13, 2009, 10:32:33 AM1/13/09

to

I also am experiencing lock-ups on a server recently upgraded from
7.0-RELEASE to 7.1-STABLE. This server is a Supermicro 6022 dual-Xeon
box running a GENERIC i386 SMP kernel. Since upgrading to 7.1-STABLE it
has started locking up daily. I see similar symptoms that Pete is
seeing - no ping response, no keyboard response, no video output on a
very lightly loaded server. =20

I have a test machine with duplicate hardware to the one locking up that
I just finished installing 7.1-STABLE on but so far it hasn't locked up.
Coincidentally my locking machine is also a DNS server but I have not
enabled DNS on my test machine yet.

Since the locking server is remote to me, I need to downgrade it to 7.0
to get it stable again. Once I finish that process, I can provide
remote access to the 7.1-STABLE machine in my office if anyone would
like to test with it.

Robert Watson

unread,

Jan 13, 2009, 11:12:11 AM1/13/09

to

On Tue, 13 Jan 2009, Pete French wrote:

>> I can't (fortunately) make it lock up. I have a DL360 G5 which is unused
>> atm. and can test on it if needed.
>
> Would it be possible to install that under amd64 and hammer it with DNS
> requests ? I have been trying to think what the difference might be between
> my webservers and the machines which are freezing, and the opnly one I an
> come up with is UDP traffic as the locking machines are serving DNS and also
> NFS.

There are significant changes in UDP locking between 7.0 and 7.1, so it could
be that we're looking at a regression there. If you're able to reproduce this
reliably, it might well be worth doing a little search-and-replace in
udp_usrreq.c along the following lines:

INP_RLOCK_ASSERT -> INP_WLOCK_ASSERT
INP_RLOCK -> INP_WLOCK
INP_RUNLOCK -> INP_WUNLOCK

However, before making these changes for debugging purposes, make sure it's
100% reproduceable without them in the configuration so that we don't find
ourselves barking up the wrong tree. Normally deadlocks along these lines
*do* allow breaking into the debugger from a serial console, but since there
are significant changes here in 7.1 it is worth trying to see if this might be
related.

Robert N M Watson
Computer Laboratory
University of Cambridge

Ken Smith

unread,

Jan 13, 2009, 1:57:24 PM1/13/09

to

--=-uJ1ZaLICaGZmjQaD2H1q
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Mon, 2009-01-12 at 21:35 +0100, Tomas Randa wrote:
> I have similar problems. The last "good" kernel I have from stable=20
> brach, october the 8. Then in next upgrade, I saw big problems with=20

> performance.
> I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

>=20
> Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a=20
> lot of time with status "waiting for opening table" or "waiting for=20
> close tables"
>=20
> I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard,=20

> areca SATA controller. Could not be problem in "da" device for example?

>=20
> Thanks Tomas Randa

Could you give r186860 a try? It is an MFC into stable/7 so if the
machine in question is something you can experiment with just updating
to stable/7 would take care of it. Otherwise if you could just manually
apply the patch to a 7.1 source tree and do a test build of the kernel
that would also do it.

I'm not experiencing lockups but this patch helped a lot on a machine I
have with a particular disk I/O pattern that resulted in extremely poor
performance with 7.1-RELEASE. This patch brought it back to its normal
performance level.

Thanks.

--=20
Ken Smith
- From there to here, from here to | kens...@cse.buffalo.edu
there, funny things are everywhere. |
- Theodore Geisel |

--=-uJ1ZaLICaGZmjQaD2H1q
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (FreeBSD)

iEYEABECAAYFAkls49kACgkQ/G14VSmup/ZOdQCfT+O86ObizOSJOhpRm9+AQprW
RzAAn2Hs5MSkUBjNZ4donsc6O4btnADM
=Sykk
-----END PGP SIGNATURE-----

--=-uJ1ZaLICaGZmjQaD2H1q--

Pete French

unread,

Jan 14, 2009, 8:27:04 AM1/14/09

to

> If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing
> ctrl-alt-break on the console to see if you can drop into the debugger, or
> issue a serial break on a serial console.

Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained
all the other stuff (WITNESS etc...). The end result...

...it no longer crashes :-(

I am not sure what to make of that! Wat could adding this to the kernel
possibly do which would make my problems go away ? Should I try just
adding this option to my GENERIC kernel and seeing if that also gives me
something stable ?

-pete.

Robert Watson

unread,

Jan 14, 2009, 8:43:25 AM1/14/09

to

On Wed, 14 Jan 2009, Pete French wrote:

>> If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing
>> ctrl-alt-break on the console to see if you can drop into the debugger, or
>> issue a serial break on a serial console.
>
> Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained
> all the other stuff (WITNESS etc...). The end result...
>
> ...it no longer crashes :-(
>
> I am not sure what to make of that! Wat could adding this to the kernel
> possibly do which would make my problems go away ? Should I try just adding
> this option to my GENERIC kernel and seeing if that also gives me something
> stable ?

Yeah, that is unexpected -- the BREAK_TO_DEBUGGER path should have almost know
effect on control flow, unlike, say, WITNESS, which significantly distorts
timing. Is there any chance you picked up any of the recent fixes that went
into RELENG_7 without noticing, and that perhaps one of those did it? With
regard to what to do: if you didn't pick up a fix without noticing, yeah, I
think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at
least, masked) the problem. Generally with this sort of testing one has to be
pretty rigorous in testing assumptions, because it's easy for changes to sneak
in. Particularly annoying are seemingly innocuous code changes that do things
like slightly rearrange kernel memory.

FWIW, I suspect the various reports we are seeing reflect more than one
problem, and that they must be relatively edge-case individually but reports
of a few problems have lead to more "coming out of the woodwork". Obviously,
the problems are not edge-case to the people experiencing them...

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 14, 2009, 9:17:25 AM1/14/09

to

> effect on control flow, unlike, say, WITNESS, which significantly distorts
> timing. Is there any chance you picked up any of the recent fixes that went
> into RELENG_7 without noticing, and that perhaps one of those did it? With

I'm pretty certian of that - I hav just been changing kernel config
files, I havent actually csup'd at all.

> regard to what to do: if you didn't pick up a fix without noticing, yeah, I
> think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at
> least, masked) the problem.

OK. I think I need at leats 4 kernels to try here: GENERIC (which should
show the problenm), my original DEBUG (which also shows the problem) plus
both of those with BREAK_TO_DEBUGGER included to see if that fixes it. Can
I just add BREAK_TO_DEBUGGER on its own to a config file ? I was wondering
if I need to include one of the other debugger options so that it has
something to break to ?

> FWIW, I suspect the various reports we are seeing reflect more than one
> problem, and that they must be relatively edge-case individually but reports
> of a few problems have lead to more "coming out of the woodwork". Obviously,
> the problems are not edge-case to the people experiencing them...

I was thinking that too - I've been guilty of this in the past too, lumping
my problem in with others under the asusmption that it's all the same. This
is onbiously pretty rare - out of 24 of the HP servers the problems only crops
up on 4 of them. But there is nothing dfferent about those 4.

I will let you know what my various kerenl compiles give me - am buolding
again from scratch, which is slow with WITNESS enabled.

-pete.

Claus Guttesen

unread,

Jan 14, 2009, 12:23:28 PM1/14/09

to

> my problem in with others under the asusmption that it's all the same. This
> is onbiously pretty rare - out of 24 of the HP servers the problems only crops
> up on 4 of them. But there is nothing dfferent about those 4.

Could it be different bios/firmware on the hp-servers?

Mr. Aliyev was unable to install 7.1 release on amd64 on a DL380 G5.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

Pete French

unread,

Jan 15, 2009, 10:36:06 AM1/15/09

to

Just an update on this - I tried the various kernels, but now the machine is
not locking up at all. As I havent actually chnaged anything then this does
not make me as happy as you might expect. I don;t know what to do now - I
daare not upgrade the machines to an OS that I know locks, but if I cant
make it lock then it is impossible to get any useful debugging info out
of.

maybe waiting for 7.2 is the best move...

-pete.

Robert Watson

unread,

Jan 15, 2009, 10:52:25 AM1/15/09

to

On Thu, 15 Jan 2009, Pete French wrote:

> Just an update on this - I tried the various kernels, but now the machine is
> not locking up at all. As I havent actually chnaged anything then this does
> not make me as happy as you might expect. I don;t know what to do now - I
> daare not upgrade the machines to an OS that I know locks, but if I cant
> make it lock then it is impossible to get any useful debugging info out of.
> maybe waiting for 7.2 is the best move...

Well, one slightly pessimistic (or realistic) view says that all software
contains bugs, it's just a question of whether or not your workload and
environment trigger those bugs in a noticeable way.

Given the inconsistency of the symptoms, I wouldn't preclude something
environmental: could it be that it was the bottom, or more likely, top box in
a rack and that your air conditioning isn't quite as effective there when the
outside temperature is above/below some threshold? Alternatively, could it be
that the workload changed very slightly -- you're doing less DNS queries, or
the network latency to the DNS server changed?

Certainly, whoever gave the advise on checking BIOS revisions is right: you
can spend a lot of time tracking down a bug to realize that one box has a
slightly different BIOS rev and therefore does/doesn't suffer from an obscure
SMI bug.

In any case, if it starts to reproduceably recur, send out mail and we can see
if we can track it down some more. BTW, did you establish if the version of
iLo you have has a remote NMI? I seem to recall that some do, and being able
to deliver an NMI is really quite valuable.

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 15, 2009, 11:00:31 AM1/15/09

to

> Given the inconsistency of the symptoms, I wouldn't preclude something
> environmental: could it be that it was the bottom, or more likely, top box in
> a rack and that your air conditioning isn't quite as effective there when the
> outside temperature is above/below some threshold?

It's a possibility - but the two machines which were exhibiting the fault
are in Slough and Baton Rouge respectively, so under very diferent cliatic
conditions. Howevere, something, has chhnaged to make it stop locking up!
The USA one was doing it every couple of hours at the start of the week, and
the UK on wouldnt last more than half an hour at one point.

> Alternatively, could it be that the workload changed very slightly -- you're
> doing less DNS queries, or the network latency to the DNS server changed?

Also a possibility - that workload is entirely dependent on customer behaviour
which is an unpredictable beast!

> Certainly, whoever gave the advise on checking BIOS revisions is right: you
> can spend a lot of time tracking down a bug to realize that one box has a
> slightly different BIOS rev and therefore does/doesn't suffer from an obscure
> SMI bug.

Yes, thats next on my list - make sure they are all on the same version.

> In any case, if it starts to reproduceably recur, send out mail and we can see
> if we can track it down some more. BTW, did you establish if the version of
> iLo you have has a remote NMI? I seem to recall that some do, and being able
> to deliver an NMI is really quite valuable.

OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon
demand, so that could be used if/whhen the fault crops up again.

thanks, will let this lie for now and resurrect the thread when I can
get some more useful data.

-pete.

Robert Watson

unread,

Jan 15, 2009, 11:05:31 AM1/15/09

to

On Thu, 15 Jan 2009, Pete French wrote:

>> In any case, if it starts to reproduceably recur, send out mail and we can
>> see if we can track it down some more. BTW, did you establish if the
>> version of iLo you have has a remote NMI? I seem to recall that some do,
>> and being able to deliver an NMI is really quite valuable.
>
> OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon
> demand, so that could be used if/whhen the fault crops up again.
>
> thanks, will let this lie for now and resurrect the thread when I can get
> some more useful data.

Excellent WRT NMI. As long as you have DDB, KDB, and BREAK_TO_DEBUGGER
compiled into the kernel, generating that should reliably get you into the
debugger. If it's possible to keep running with INVARIANTS and WITNESS, or
just INVARIANTS if WITNESS slows things down too much, that would be
desirable. You might want to give the NMI a test run just to make sure it
behaves as you think it should, though -- be aware that if DDB/KDB aren't
compiled into the kernel, then an NMI will panic the box.

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 15, 2009, 12:20:09 PM1/15/09

to

> desirable. You might want to give the NMI a test run just to make sure it
> behaves as you think it should, though -- be aware that if DDB/KDB aren't
> compiled into the kernel, then an NMI will panic the box.

Unfortunately it does this...

http://toybox.twisted.org.uk/~pete/71_nmi1.png

That is locked up too - hitting return does nothing. I was hoping it
was just garbled output but had actually gone to the debugger.
Apparently not.

Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER,
which does work as I have tested it with CTRL_ALT_ESC.

Mmmmm....

-pete.

Robert Watson

unread,

Jan 15, 2009, 12:50:01 PM1/15/09

to

On Thu, 15 Jan 2009, Pete French wrote:

>> desirable. You might want to give the NMI a test run just to make sure it
>> behaves as you think it should, though -- be aware that if DDB/KDB aren't
>> compiled into the kernel, then an NMI will panic the box.
>
> Unfortunately it does this...
>
> http://toybox.twisted.org.uk/~pete/71_nmi1.png
>
> That is locked up too - hitting return does nothing. I was hoping it was
> just garbled output but had actually gone to the debugger. Apparently not.
>
> Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which
> does work as I have tested it with CTRL_ALT_ESC.

Er, that's rather upsetting. John, do you have any ideas about this?

Robert N M Watson
Computer Laboratory
University of Cambridge

John Baldwin

unread,

Jan 15, 2009, 5:51:27 PM1/15/09

to

On Thursday 15 January 2009 12:49:11 pm Robert Watson wrote:
> On Thu, 15 Jan 2009, Pete French wrote:
>
> >> desirable. You might want to give the NMI a test run just to make sure
it
> >> behaves as you think it should, though -- be aware that if DDB/KDB aren't
> >> compiled into the kernel, then an NMI will panic the box.
> >
> > Unfortunately it does this...
> >
> > http://toybox.twisted.org.uk/~pete/71_nmi1.png
> >
> > That is locked up too - hitting return does nothing. I was hoping it was
> > just garbled output but had actually gone to the debugger. Apparently not.
> >
> > Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which
> > does work as I have tested it with CTRL_ALT_ESC.
>
> Er, that's rather upsetting. John, do you have any ideas about this?

The rest of the thread I have no context on still. The garbage is due to
competing panics I think. The problem is we don't single thread the printf's
in 'trap_fatal()'. We should probably have some sort of simple spin lock
thing in the x86 code to only allow 1 CPU at a time to run through that
routine.

--
John Baldwin

Pete French

unread,

Jan 16, 2009, 7:20:58 AM1/16/09

to

> If you are able to get into the debugger, the normal commands would be most
> helpful, especially if you can log the results:

It finally locked up, and ctrl-alt-esc got me into the debugger at
last! is there anything else you want me to get whilst it is
like that aside from:

> ps
> show lockedvnods
> show alllocks

which I can go and capture as screenshots. I can probably sort out console
access to it potentially if taht would eb useful whilst it is in this
state ?

-pete.

Pete French

unread,

Jan 16, 2009, 7:36:47 AM1/16/09

to

> ps

output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
there are a lot of processes as this machine runes the same webservices
as the actual webservers, just that nobody connects to them.

> show lockedvnods

nothing - there are no locked vnodes

> show alllocks

this gives me 'no suich command' theres a whole list of things I
can show, but none of them look like all the locks. what about the locktree
or the lockchain ?

Chagin Dmitry

unread,

Jan 16, 2009, 8:32:22 AM1/16/09

to

On Fri, Jan 16, 2009 at 12:35:49PM +0000, Pete French wrote:
> > ps
>
> output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
> there are a lot of processes as this machine runes the same webservices
> as the actual webservers, just that nobody connects to them.
>
> > show lockedvnods
>
> nothing - there are no locked vnodes
>
> > show alllocks
>
> this gives me 'no suich command' theres a whole list of things I
> can show, but none of them look like all the locks. what about the locktree
> or the lockchain ?
>

hi, please type:
show lock 0xffffff0001254d20
and then show thread 0xXXXXXXXXXXX where XXXXX is 'owner' of previous output.

--
Have fun!
chd

Pete French

unread,

Jan 16, 2009, 8:35:22 AM1/16/09

to

> hi, please type:
> show lock 0xffffff0001254d20
> and then show thread 0xXXXXXXXXXXX where XXXXX is 'owner' of previous output.

http://toybox.twisted.org.uk/~pete/71_pdns_lock.png

That's in Power DNS - which is interesting because the one difference
between the boxes that lock and those which dont is that the locking
ones are serving DNS.

-pete.

Chagin Dmitry

unread,

Jan 16, 2009, 8:48:43 AM1/16/09

to

On Fri, Jan 16, 2009 at 01:34:14PM +0000, Pete French wrote:
> > hi, please type:
> > show lock 0xffffff0001254d20
> > and then show thread 0xXXXXXXXXXXX where XXXXX is 'owner' of previous output.
>
> http://toybox.twisted.org.uk/~pete/71_pdns_lock.png
>
> That's in Power DNS - which is interesting because the one difference
> between the boxes that lock and those which dont is that the locking
> ones are serving DNS.
>

trace 832

--
Have fun!
chd

Robert Watson

unread,

Jan 16, 2009, 8:52:19 AM1/16/09

to

On Fri, 16 Jan 2009, Pete French wrote:

>> hi, please type: show lock 0xffffff0001254d20 and then show thread
>> 0xXXXXXXXXXXX where XXXXX is 'owner' of previous output.
>
> http://toybox.twisted.org.uk/~pete/71_pdns_lock.png
>
> That's in Power DNS - which is interesting because the one difference
> between the boxes that lock and those which dont is that the locking ones
> are serving DNS.

I rather feared as much. Let's run down the path of "perhaps there's a
problem with the new UDP locking code" for a bit and see where it takes us.
Is it possible to run those boxes with WITNESS -- I believe that the fact that
"show alllocks" is failing is because WITNESS isn't present. The other thing
we can do is revert UDP to using purely write locks -- the risk there is that
it might change the timing but not actually resolve the bug, so if we can
analyze it a bit using WITNESS first that would be useful.

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 16, 2009, 9:10:26 AM1/16/09

to

> trace 832

http://toybox.twisted.org.uk/~pete/71_trace_832_1.png
http://toybox.twisted.org.uk/~pete/71_trace_832_2.png

-pete.

Pete French

unread,

Jan 16, 2009, 9:15:03 AM1/16/09

to

> I rather feared as much. Let's run down the path of "perhaps there's a
> problem with the new UDP locking code" for a bit and see where it takes us.
> Is it possible to run those boxes with WITNESS -- I believe that the fact that
> "show alllocks" is failing is because WITNESS isn't present.

Yes, I can do that. The only reason I wasn't running with WITNESS is that
it didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if
a simple GENERIC kernel would lock up when I added that. I will go
back and add WITNESS when you tell me theres nothing more
we can get out of this lock up (recompiling will involve restarting the
machine so I loose the 'boekn to debugger' state). Should I add
anything else ? Skip spinlocks ? Invariants ?

> The other thing we can do is revert UDP to using purely write locks -- the
> risk there is that it might change the timing but not actually resolve the
> bug, so if we can analyze it a bit using WITNESS first that would be useful.

Yes, I will run with WITNESS and anything else you might want. Is there
anything else you, or anyone else, wants from this kernel ? It may take
another day to lock up when I've restarted it unfortunately.

Robert Watson

unread,

Jan 16, 2009, 9:17:58 AM1/16/09

to

On Fri, 16 Jan 2009, Pete French wrote:

>> I rather feared as much. Let's run down the path of "perhaps there's a
>> problem with the new UDP locking code" for a bit and see where it takes us.
>> Is it possible to run those boxes with WITNESS -- I believe that the fact
>> that "show alllocks" is failing is because WITNESS isn't present.
>
> Yes, I can do that. The only reason I wasn't running with WITNESS is that it
> didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if a
> simple GENERIC kernel would lock up when I added that. I will go back and
> add WITNESS when you tell me theres nothing more we can get out of this lock
> up (recompiling will involve restarting the machine so I loose the 'boekn to
> debugger' state). Should I add anything else ? Skip spinlocks ? Invariants ?
>
>> The other thing we can do is revert UDP to using purely write locks -- the
>> risk there is that it might change the timing but not actually resolve the
>> bug, so if we can analyze it a bit using WITNESS first that would be
>> useful.
>
> Yes, I will run with WITNESS and anything else you might want. Is there
> anything else you, or anyone else, wants from this kernel ? It may take
> another day to lock up when I've restarted it unfortunately.

If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good.
WITNESS does a number of things, including tracking (and being judgemental
about) lock order. One nice side effect of that tracking is that we keep
track of a lot more lock state explicitly, so DDB's "show allocks", "show
locks", etc, commands can build on that. "show lockedvnods" works without
WITNESS, though, so your results so far suggest this is likely not related to
vnode locking.

Robert N M Watson
Computer Laboratory
University of Cambridge

Pete French

unread,

Jan 16, 2009, 10:42:19 AM1/16/09

to

> If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good.
> WITNESS does a number of things, including tracking (and being judgemental
> about) lock order. One nice side effect of that tracking is that we keep
> track of a lot more lock state explicitly, so DDB's "show allocks", "show
> locks", etc, commands can build on that. "show lockedvnods" works without
> WITNESS, though, so your results so far suggest this is likely not related to
> vnode locking.

Right, I've gone back to my DEBUG kernel which has a lot of options in it,
including all the above. It has locked almost immediately luckily, so
now I have it sitting at the debugger prompt. The output from 'show alllocks'
is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

Which of these are worth tracing ?

-pte.

Pete French

unread,

Jan 16, 2009, 12:39:25 PM1/16/09

to

Just confinuing to look at this with the help of Dimity, and the
output from 'bt' is here:

http://toybox.twisted.org.uk/~pete/71_bt.png

The top bit of that is from my 'show alllocks' the full version
of whih is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

-pete.