Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hair Restorer

4 views
Skip to first unread message

Jonathan Sizz

unread,
Jul 18, 2006, 4:47:48 PM7/18/06
to
I refer to SCO's redacted whinge - 724-the-mothership, not
724-A-the-shuttlecraft. Bottom of page 44. Footnote 12.

I take it that "hair restorer" is a Very Silly Nickname for a Very Silly
Bugfix of doing four writes to some gun-totin' register. If so, SCO is
now actually claiming, get this, IBM's Sekrit Knowlege of i386 APIC bugs.

*boggle*

Anyone know any more about this?

Jonathan Sizz

unread,
Jul 18, 2006, 5:11:14 PM7/18/06
to
Okay, following up on myself here.

Read this and then tell me how, even on Planet Zzbzz, this is a contractually
controlled method or concept??

http://www.ussg.iu.edu/hypermail/linux/kernel/0302.3/0955.html

---->8-------->8-------->8-------->8-------->8-------->8-------->8----

From: James Cleverdon (jame...@us.ibm.com)
Date: Wed Feb 26 2003 - 19:32:09 EST

On Wednesday 26 February 2003 08:52 am, Martin J. Bligh wrote:
[ Snip! ]
> >
> > Anyway, the above is clearly not what we're doing with the ESR right now.
> >
> > Martin: in the esr disable case you clearly write the ESR multiple times
> > ("over the head with a big hammer"), and you must do that because you
> > noticed that a single write was insufficient. Why four? Did you just
> > decide that as long as you're doing multiple writes, you might as well
> > just do "several". Or did four writes work and two didn't?
>
> The latter, IIRC, 2 writes worked most of the time, but never really fixed
> it. Using any kind of logical analysis never seemed to work on that chip
> ... brute force, trial and error, and 3 months of tearing my hair out was
> the only thing that succeeded in the end. A time I have no wish to revisit
> ;-)
>
> cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
> some pointers to hair-restorer during the Linux timeframe.
>
> M.
> -

You want _that_ story, eh? 8^)

* * * * *

Yeah we had ESR problems on the original NUMA-Q boxes with P6 CPUs. On system
shutdown, CPU 0 on one or more secondary nodes would occasionally spasm with
an infinite stream of APIC error interrupts claiming invalid message. A
couple hardware guys and I spent a lot of time looking at the APIC bus with
special APIC bus analyzers, etc. We _never_ caught a malformed message on
the APIC bus.

Once a CPU started weirding out like this, it was impossible to make it shut
up. We could clear the error status, and it would show cleared in the ESR,
but the local APIC would reissue the same error interrupt as soon as we
returned from the error handler.

In fact, with kernel printf turned off we would get about a million of them
per second, faster than most APIC messages could be sent over the APIC bus.
(This was a 16.6667 MHz two bit wide bus. Messages were about 10 to 40
frames long.)

Thus, I concluded that it was some weird error state in the local APIC. We
never got any answer back from Intel on how to clear this state, let alone
admission that it existed, so we just turned off the APIC error IRQ. Since
we were shutting down the system anyway, this seemed an adequate kludge.

Writing 0 to the ESR four times was done out of paranoia, and a desire to
grind the clear deeper into the local APIC's state machine. I have no
evidence that it ever really fixed this bug. Nothing did.

Maybe this weirdness was fixed in P2s or later CPUs. Maybe. Intel never did
say anything about it to us. Regardless, the four writes to ESR is still
enshrined in Dynix/PTX's APIC error handler, and will remain a hidden
testimony to this bug for as long as IBM maintains PTX support.

--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

---->8-------->8-------->8-------->8-------->8-------->8-------->8----

infosecgroupie

unread,
Jul 18, 2006, 5:15:11 PM7/18/06
to

Refer to http://blog.gmane.org/gmane.linux.kernel/day=20030227

In full:

Re: [BUG] 2.5.63: ESR killed my box!

On Wednesday 26 February 2003 08:52 am, Martin J. Bligh wrote:
[ Snip! ]
> >
> > Anyway, the above is clearly not what we're doing with the ESR right now.
> >
> > Martin: in the esr disable case you clearly write the ESR multiple times
> > ("over the head with a big hammer"), and you must do that because you
> > noticed that a single write was insufficient. Why four? Did you just
> > decide that as long as you're doing multiple writes, you might as well
> > just do "several". Or did four writes work and two didn't?
>
> The latter, IIRC, 2 writes worked most of the time, but never really fixed
> it. Using any kind of logical analysis never seemed to work on that chip
> ... brute force, trial and error, and 3 months of tearing my hair out was
> the only thing that succeeded in the end. A time I have no wish to revisit
>
>

> cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
> some pointers to hair-restorer during the Linux timeframe.
>
> M.
> -

You want _that_ story, eh? 8^)

So the reference to "hair restorer" backreferences to "tearing my hair out".


- i_s_g
--
infosecgroupie
http://www.finchhaven.com/TSCOG/


----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Jonathan Sizz

unread,
Jul 18, 2006, 5:23:26 PM7/18/06
to
> So the reference to "hair restorer" backreferences to "tearing my hair out".
> - i_s_g

Incredibly, even though our messages crossed, I'd missed that detail -- thanks!

infosecgroupie

unread,
Jul 18, 2006, 5:40:25 PM7/18/06
to

Here's the original, full-text from the LKML:

http://lkml.org/lkml/2003/2/26/272

From James Cleverdon <>
Subject Re: [BUG] 2.5.63: ESR killed my box!
Date Wed, 26 Feb 2003 16:32:09 -0800

On Wednesday 26 February 2003 08:52 am, Martin J. Bligh wrote:
[ Snip! ]
> >
> > Anyway, the above is clearly not what we're doing with the ESR right now.
> >
> > Martin: in the esr disable case you clearly write the ESR multiple times
> > ("over the head with a big hammer"), and you must do that because you
> > noticed that a single write was insufficient. Why four? Did you just
> > decide that as long as you're doing multiple writes, you might as well
> > just do "several". Or did four writes work and two didn't?
>
> The latter, IIRC, 2 writes worked most of the time, but never really fixed
> it. Using any kind of logical analysis never seemed to work on that chip
> ... brute force, trial and error, and 3 months of tearing my hair out was
> the only thing that succeeded in the end. A time I have no wish to revisit

> ;-)


>
> cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
> some pointers to hair-restorer during the Linux timeframe.
>
> M.
> -

You want _that_ story, eh? 8^)

*****

0 new messages