'crafty' took ages to kill

Dr. David Kirkby

unread,

Dec 30, 2003, 10:19:54 AM12/30/03

to

Sorry for the cross-posting to two quite different newsgroups, but it
seems appropiate in the circumstances.

'crafty' is a well known chess program that can run on UNIX machines.
I've just spent quite a while trying to 'kill' an unwanted 'crafty'
process running on my UNIX worksation, which I noticed had eaten up
some 61 hours of CPU time. Despite

% kill pid

then when that failed

% kill -9 pid

a couple of times, it would not die. Finally after about 5 kill -9's
the process died. Has anyone seen this before? I'm running a Sun Ultra
80 workstation, Solaris 9, 4 x 450 MHz CPUs, 4 GB RAM, crafty 19.7
built for multi-threaded operation.

What can stop a process responding to SIGKILL ?? Since there were two
copies of this process running (one intentional, one not intenstional)
each configured to use 4 CPUs, the load average was about 8, but that
should not be excessive for a quad processor machine. The machine did
not appear under any strain, and interactive peformance was fine, so
I'm a bit puzzled why this should happen.

I've seen similar things before on a Sun and once a reboot was
required. I'm just not quite sure how it can occur.

As usual, my email address can be found at
http://atlc.sourceforge.net/contact.html
should anyone choose to email me, although the newsgroup is probably a
better place for a response.

Lyle Merdan

unread,

Dec 30, 2003, 12:13:02 PM12/30/03

to

In comp.sys.sun.admin Dr. David Kirkby <see_my_signature_f...@hotmail.com> wrote:
: Sorry for the cross-posting to two quite different newsgroups, but it

: % kill pid

: then when that failed

: % kill -9 pid

If a kill -9 doesn't zap a process I usuall try kill -XCPU and that
will many times do the trick.

Lyle

Barry Margolin

unread,

Dec 30, 2003, 2:44:19 PM12/30/03

to

In article <c99d2c79.03123...@posting.google.com>,
see_my_signature_f...@hotmail.com (Dr. David Kirkby)
wrote:

> What can stop a process responding to SIGKILL ??

The only thing that usually does this is the process being hung in the
kernel, e.g. it's stuck trying to access a device that's not responding.

Usually it's a device like a disk or tape drive, although years ago
SunOS 4.x had a bug where if the user used Control-S to suspend terminal
output then the process would be unkillable until they resumed with
Control-Q.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA

Ian Fitchet

unread,

Dec 30, 2003, 2:50:08 PM12/30/03

to

see_my_signature_f...@hotmail.com (Dr. David Kirkby) writes:

> What can stop a process responding to SIGKILL ??

Nothing can stop it responding to SIGKILL once the signal has been
delivered. A SIGKILL might well be queued for the process
(unblockable and unignorable as SIGKILL is) but cannot be delivered
because, in Solaris: a) the process might be doing IO, b) the process
might be being traced.

There are probably other vital and important reasons why the expected
behaviour hasn't happened yet.

And then there are bugs in the OS. Yes, even in Solaris.

Cheers,

Ian

Robert Hyatt

unread,

Dec 30, 2003, 9:48:04 PM12/30/03

to

In rec.games.chess.computer Dr. David Kirkby <see_my_signature_f...@hotmail.com> wrote:
> Sorry for the cross-posting to two quite different newsgroups, but it
> seems appropiate in the circumstances.

> 'crafty' is a well known chess program that can run on UNIX machines.
> I've just spent quite a while trying to 'kill' an unwanted 'crafty'
> process running on my UNIX worksation, which I noticed had eaten up
> some 61 hours of CPU time. Despite

> % kill pid

> then when that failed

> % kill -9 pid

> a couple of times, it would not die. Finally after about 5 kill -9's
> the process died. Has anyone seen this before? I'm running a Sun Ultra
> 80 workstation, Solaris 9, 4 x 450 MHz CPUs, 4 GB RAM, crafty 19.7
> built for multi-threaded operation.

Most likely you had a large hash table setting. The original kill
command probably caused Crafty to crash, which will then write a .core
file. With a big hash, egtb cache, egtb decompression indices, you can
get a .core file that will choke a large mule...

Otherwise no idea as a process can _never_ ignore kill -9...

> What can stop a process responding to SIGKILL ?? Since there were two
> copies of this process running (one intentional, one not intenstional)
> each configured to use 4 CPUs, the load average was about 8, but that
> should not be excessive for a quad processor machine. The machine did
> not appear under any strain, and interactive peformance was fine, so
> I'm a bit puzzled why this should happen.

> I've seen similar things before on a Sun and once a reboot was
> required. I'm just not quite sure how it can occur.

I don't see this on linux so I am not sure, but the most common problem
is the huge core file that can take forever to write, particularly if you
are using NFS for your directory.

> As usual, my email address can be found at
> http://atlc.sourceforge.net/contact.html
> should anyone choose to email me, although the newsgroup is probably a
> better place for a response.

--
Robert M. Hyatt, Ph.D. Computer and Information Sciences
hy...@uab.edu University of Alabama at Birmingham
(205) 934-2213 136A Campbell Hall
(205) 934-5473 FAX Birmingham, AL 35294-1170

Robert Hyatt

unread,

Dec 30, 2003, 9:49:50 PM12/30/03

to

In rec.games.chess.computer Lyle Merdan <ly...@visi.com> wrote:

> If a kill -9 doesn't zap a process I usuall try kill -XCPU and that
> will many times do the trick.

> Lyle

-XCPU is "softer" than -9 (SIGKILL). That just reports "CPU time
exceeded" but the process can still run a bit longer...

Dr. David Kirkby

unread,

Jan 2, 2004, 4:37:10 PM1/2/04

to

Robert Hyatt <hy...@crafty.cis.uab.edu> wrote in message news:<bstdd4$5sm$1...@juniper.cis.uab.edu>...

> > 'crafty' is a well known chess program that can run on UNIX machines.
> > I've just spent quite a while trying to 'kill' an unwanted 'crafty'
> > process running on my UNIX worksation, which I noticed had eaten up
> > some 61 hours of CPU time. Despite
>
> > % kill pid
>
> > then when that failed
>
>
> > % kill -9 pid
>
> > a couple of times, it would not die. Finally after about 5 kill -9's
> > the process died. Has anyone seen this before? I'm running a Sun Ultra
> > 80 workstation, Solaris 9, 4 x 450 MHz CPUs, 4 GB RAM, crafty 19.7
> > built for multi-threaded operation.
>
> Most likely you had a large hash table setting. The original kill
> command probably caused Crafty to crash, which will then write a .core
> file. With a big hash, egtb cache, egtb decompression indices, you can
> get a .core file that will choke a large mule...
>

Yes I did have large hash table settings - probably 500 Mb or more for
hash and hasp. However, /etc/system contains the line:

set sys:coredumpsize 0

which should prevent coredumps being produced - I do that for
securetiy reasons.

> Otherwise no idea as a process can _never_ ignore kill -9...

> I don't see this on linux so I am not sure, but the most common problem
> is the huge core file that can take forever to write, particularly if you
> are using NFS for your directory.

I don't normally on Solaris, but on this occasion I did. It's a long
time since I've seen this behaviour under Solaris (> 1 year), but I
must admit I have seen it before.

Am I right in assuming crafty does not use any form of kernal locking
on Solaris, but just pthreads for SMP support?

Dr. David Kirkby
email address at: http://atlc.sourceforge.net/contact.html

Robert Hyatt

unread,

Jan 2, 2004, 4:58:35 PM1/2/04

to

In rec.games.chess.computer Dr. David Kirkby <see_my_signature_f...@hotmail.com> wrote:

> set sys:coredumpsize 0

Yes. It does use the mutex pthread_lock() stuff of course, for
critical sections in the SMP code. No idea what else might cause a
slow termination...

> Dr. David Kirkby
> email address at: http://atlc.sourceforge.net/contact.html

--

Anders Thulin

unread,

Jan 3, 2004, 2:37:45 AM1/3/04

to

Dr. David Kirkby wrote:

> Yes I did have large hash table settings - probably 500 Mb or more for
> hash and hasp. However, /etc/system contains the line:
>
> set sys:coredumpsize 0
>
> which should prevent coredumps being produced - I do that for
> securetiy reasons.

Sorry -- not chess oriented, but perhaps of use for the the other group:

It won't do you any good: coredumpsize is not a valid kernel variable,
and there has not been any kernel module called sys since at least
Solaris 2. So it won't do anything. (Yes, there are Sun documents that
claim it should be used, but they are wrong.)

Use coreadm instead.

See http://dbforums.com/arch/128/2002/7/426204 for my sources.

--
Anders Thulin a...@algonet.se http://www.algonet.se/~ath

Casper H.S. Dik

unread,

Jan 3, 2004, 6:04:13 AM1/3/04

to

see_my_signature_f...@hotmail.com (Dr. David Kirkby) writes:

>set sys:coredumpsize 0

>which should prevent coredumps being produced - I do that for
>securetiy reasons.

That line in /etc/system has no effect; and it never had any effect
either.

I know that it has been documented in some documents, even some originating
from Sun; but it never was a Solaris tunable.

The reason why the tunable gives no error either is fairly simple: as
long as the module "sys" isn't loaded, the kernel doesn't know that
there's no "sys'coredumpsize" tunable. Since there's no module "sys" the
error is never detected. (There never was a Solaris module called "sys"
either; and I've checked all source code back to Solaris 2.0 so I
*know* that this is a statement of fact)

I supposed that computers have progressed to the point that they're
sufficiently magic to warrant prayer-like command sequences and configuration
options, but I degress.

You will need to use "coreadm" to limit core dumps or redirect them,
on sufficiently recent Solaris versions. (Solaris 7 with kernel patch rev
106541-06 (sparc) or 106542-06 (intel) and later or later Solaris releases)

>I don't normally on Solaris, but on this occasion I did. It's a long
>time since I've seen this behaviour under Solaris (> 1 year), but I
>must admit I have seen it before.

Coredumps could still be an issue considering the above.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Dr. David Kirkby

unread,

Jan 3, 2004, 9:17:57 PM1/3/04

to

Casper H.S. Dik <Caspe...@Sun.COM> wrote in message news:<3ff6a1ad$0$329$e4fe...@news.xs4all.nl>...

> see_my_signature_f...@hotmail.com (Dr. David Kirkby) writes:
>
> >set sys:coredumpsize 0
>
> >which should prevent coredumps being produced - I do that for
> >securetiy reasons.
>
> That line in /etc/system has no effect; and it never had any effect
> either.
>
> I know that it has been documented in some documents, even some originating
> from Sun; but it never was a Solaris tunable.

<snip>

> Coredumps could still be an issue considering the above.
>
> Casper

Casper,

I do have:
limit coredumpsize 0M
in my .cshrc file.

coreadm shows:

sparrow /export/home/davek % coreadm
global core file pattern:
init core file pattern: core
global core dumps: disabled
per-process core dumps: enabled
global setid core dumps: disabled
per-process setid core dumps: disabled
global core dump logging: disabled

(I've never used coreadm, so I guess they are the system defaults).

I've not seen any coredumps produced, from either crafty or other application.

Dr. David Kirkby.

email at:
http://atlc.sourceforge.net/contact.html

Casper H.S. Dik

unread,

Jan 4, 2004, 9:39:19 AM1/4/04

to

see_my_signature_f...@hotmail.com (Dr. David Kirkby) writes:

>I do have:
>limit coredumpsize 0M
>in my .cshrc file.

That should work.

>coreadm shows:

>sparrow /export/home/davek % coreadm
> global core file pattern:
> init core file pattern: core
> global core dumps: disabled
> per-process core dumps: enabled
> global setid core dumps: disabled
> per-process setid core dumps: disabled
> global core dump logging: disabled

>(I've never used coreadm, so I guess they are the system defaults).

They are.

>I've not seen any coredumps produced, from either crafty or other application.

Then perhaps it's something that happens when tearing down a large
address space. (Probably lots of crosscalls and such)