memset() bug which caused MPFR test failures will be fixed by Sun.

185 views
Skip to first unread message

Dr. David Kirkby

unread,
Jul 23, 2009, 8:36:15 PM7/23/09
to mp...@loria.fr, sage-...@googlegroups.com
Many people looked at the reason there were 20 test failures of the MPFR
test suite on a Sun T5240. I believe the issue is due to memset).

I telephoned Sun a couple of days back to report this officially. They
have been extremely efficient at handling this case.

I now have some information from Sun. I asked the engineer if I could
make it public, and he has said yes. He is in fact going to put some of
it on a mailing list, and it will eventually appear in Sunsolve.

I'm told the fix will be backported to Solaris 10 and I should have a
Interim Diagnostic Relief to test myself in a few weeks, but it wont be
a public patch for some time, until it's fully tested.

Dave

----
Your service case regarding memset(3C)'s behaviour on sun4v systems
when the size_t argument is nonzero but zero mod 232 has now been
transferred to Europe, and I have taken ownership. And I intend to
keep ownership until we've reached a mutually acceptable resolution,
barring vacation stand-ins and unforeseen events.

Let me quickly recapitulate the facts:

+ This is on record as a bug under Change Request id 6507249,
+ it's fixed in the internal development version of (future)
Solaris and thus in OpenSolaris based on builds snv_62 or later,
+ it affects only
+ 32-bit applications
+ running on Solaris 10
+ on all SPARC sun4v (CoolThreads^TM) platforms,
+ it originates in the hardware-optimized libc_psr_hwcap[12].so.1
which (by default) get mounted over /platform/sun4v/lib/libc_psr.so.1
during the Solaris boot sequence,
+ it affects invocations of memset(3C) where the third (size_t)
argument is nonzero but its low-order 32 bits are zero (thus
it ought to be zero considered as a size_t).

(A subtle point is that it won't affect the *first* call to memset()
after exec, as the runtime loader processing for lazy symbol binding
will clear the upper 32 bits as a side effect before passing control
to the newly-bound function entry point.)

The bugfix has not (yet) been beackported to Solaris 10 because there
has not (yet) been any tangible demand for such a backport. Until
just now, there had not yet been a single external customer record
on CR#6507249!

I am adding one for this present service request now.

In fact, the vast majority of application code would not be at risk
of being affected by this bug. Most uses of memset() pass a compile-
time constant for the size, often some sizeof(struct such_and_such).
Passing a manifest 32-bit int variable for the size will also avoid
the bug. It can only happen when memset() is invoked with some
nontrivial arithmetic expression, or some explicit 64-bit variable
for the size. Such code idioms are quite rare.

Also, there are a number of workarounds to choose from, depending on
the situation:

+ When the application source code is available for modification:
+ store the expression result in a variable and then pass the
variable to memset() (though compiler optimizations might
subvert this),
+ test the variable for being 32-bit-equal-to-zero and bypass
memset() if it is,

+ or at runtime:
+ invoke the application with LD_NOAUXFLTR=1 in the environment
(cf. man ld.so.1(1), which selectively disables the optimized
libc_psr.so.1 just for this process),
+ umount the optimized libc_psr.so.1 system-wide,
+ interpose a different memset() implementation e.g. via an
LD_PRELOAD'ed shared object.

> Since the MPFR library code we are using is open source, we have managed
> to work around this Solaris bug, by sticking an 'if' statement in front
> of the call to macro which calls memset().
>
> Though of course I don't know if it will affect anything else. So I
> guess it is safer to unmount this, but I assume that will have quite a
> performance impact.

The performance impact of not using the optimized libc_psr.so.1
varies widely among applications, depending on how much memset()ing
and memcpy()ing and memmove()ing they do. It can range all the way
from negligible to a few ten percent in benchmarks.

But the LD_NOAUXFLTR=1 approach limits the performance impact to
those applications which are known or suspected to be affected
by the bug.---

<SNIP>

Would you be willing to test-drive any future binary fix in the
shape of an Interim Diagnostic Relief prior to patch creation,
as well as a release-candidate patch at the T-Patch stage prior
to patch release? For background information on IDRs, please see:

http://sunsolve.sun.com/show.do?target=IDR

Since the affected deliverables (libc_psr_hwcap1.so.1) have also
been modified by existing patches including some Kernel Update
patches, any such IDR would (have to) be built to fit onto a
particular set of patch revisions. The most recent change had
come in Kernel Update patch 127127-11, thus the easiest would
be an IDR with a hard dependency on this patch. Should you have
need for an IDR against older patch levels than this, please do
let me know!

William Stein

unread,
Jul 23, 2009, 8:42:47 PM7/23/09
to sage-...@googlegroups.com, mp...@loria.fr
On Thu, Jul 23, 2009 at 5:36 PM, Dr. David
Kirkby<david....@onetel.net> wrote:
>
> Many people looked at the reason there were 20 test failures of the MPFR
> test suite on a Sun T5240. I believe the issue is due to memset).
>
> I telephoned Sun a couple of days back to report this officially. They
> have been extremely efficient at handling this case.
>
> I now have some information from Sun. I asked the engineer if I could
> make it public, and he has said yes. He is in fact going to put some of
> it on a mailing list, and it will eventually appear in Sunsolve.
>
> I'm told the fix will be backported to Solaris 10 and I should have a
> Interim Diagnostic Relief to test myself in a few weeks, but it wont be
> a public patch for some time, until it's fully tested.
>
> Dave

Wow, Dave, you are amazingly good at doing things right and being a
professional! Thanks!!

William
--
William Stein
Associate Professor of Mathematics
University of Washington
http://wstein.org

Dr. David Kirkby

unread,
Jul 23, 2009, 9:26:32 PM7/23/09
to sage-...@googlegroups.com, mp...@loria.fr
William Stein wrote:
> On Thu, Jul 23, 2009 at 5:36 PM, Dr. David
> Kirkby<david....@onetel.net> wrote:
>> Many people looked at the reason there were 20 test failures of the MPFR
>> test suite on a Sun T5240. I believe the issue is due to memset).
>>
>> I telephoned Sun a couple of days back to report this officially. They
>> have been extremely efficient at handling this case.
>>
>> I now have some information from Sun. I asked the engineer if I could
>> make it public, and he has said yes. He is in fact going to put some of
>> it on a mailing list, and it will eventually appear in Sunsolve.
>>
>> I'm told the fix will be backported to Solaris 10 and I should have a
>> Interim Diagnostic Relief to test myself in a few weeks, but it wont be
>> a public patch for some time, until it's fully tested.
>>
>> Dave
>
> Wow, Dave, you are amazingly good at doing things right and being a
> professional! Thanks!!
>
> William

Cheers.

>> Your service case regarding memset(3C)'s behaviour on sun4v systems
>> when the size_t argument is nonzero but zero mod 232 has now been
>> transferred to Europe, and I have taken ownership. And I intend to
>> keep ownership until we've reached a mutually acceptable resolution,
>> barring vacation stand-ins and unforeseen events.

That was supposed to be 2 to the power 32, but my 2^32 appears to come
out as the number 232 there.

Reply all
Reply to author
Forward
0 new messages