Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The Flaw: Enough Already, Please...

15 views
Skip to first unread message

Cam Lee

unread,
Dec 26, 1994, 2:20:48 PM12/26/94
to
Hasn't this subject been beaten to death? Is there anymore that can be
said
that hasn't already been said? Whether you agree or disagree with how
Intel has
handled this issue, the fact of the matter is they've attempted to rectify
the
situation by offering a replacement for the life fo the PC.

Whether your views are good, bad or indifferent on the subject, 6 months
to a year
from now this subject will be all but forgotten. So, let's get on with
technical discussions
about topics that will broaden our horizons?

Cam

Henry G. Baker

unread,
Dec 26, 1994, 5:01:10 PM12/26/94
to
In article <3dn52g$o...@newsbf02.news.aol.com> cam...@aol.com (Cam Lee) writes:
>Whether your views are good, bad or indifferent on the subject, 6 months
>to a year
>from now this subject will be all but forgotten.
^^^^^^^^^^^^^^^^^

I don't think so. Six months from now, people will still be
'decontaminating' their Pentium-calculated data, including spreadsheet
data that they calculated on a Pentium and stored on a file server
(and some backup tapes). People will still be paying extra on their
mortgages whose payments were calculated on a Pentium, etc.

No, the fallout from the Pentium disaster is likely to be worse than
Chernobyl (in the U.S., at least), since most of the contaminated data
is here.

M. L.

unread,
Dec 27, 1994, 9:43:39 AM12/27/94
to
> No, the fallout from the Pentium disaster is likely to be worse than
> Chernobyl (in the U.S., at least), since most of the contaminated data
> is here.
>

This is without a doubt the most idiotic statement found in the
sea of stupidity flowing through this news group.

Henry Baker

unread,
Dec 27, 1994, 11:48:06 AM12/27/94
to
In article <3dp96r$h...@mhaal.inhouse.compuserve.com>, "M. L."
<ml...@csi.compuserve.com> wrote:

Perhaps you've never attempted to do any serious floating point work.

I once worked at a research hospital where we did multiple regression
analyses that went on for days. We were attempting to separate out
the influences from different variables in very noisy data.

A problem like the Pentium bug is the worst possible problem. It isn't
large enough to be 'obviously' wrong, but you have to re-do _all_ of
your calculations.

Furthermore, if you have several different types of computers, you
now have the additional problem of trying to remember which computations
were run on which computer. If you recall, one of the major benefits
of the IEEE floating point standard was so that numeric runs on different
machines could finally be compared.

This is a major nightmare. Chernobyl didn't cause anything like this
in any U.S. research laboratory that I know.

Phil Payne

unread,
Dec 28, 1994, 5:48:13 AM12/28/94
to
In article <hbakerD1...@netcom.com>

hba...@netcom.com "Henry G. Baker" writes:

> In article <3dn52g$o...@newsbf02.news.aol.com> cam...@aol.com (Cam Lee) writes:
> >Whether your views are good, bad or indifferent on the subject, 6 months
> >to a year
> >from now this subject will be all but forgotten.
> ^^^^^^^^^^^^^^^^^
>
> I don't think so. Six months from now, people will still be
> 'decontaminating' their Pentium-calculated data, including spreadsheet
> data that they calculated on a Pentium and stored on a file server
> (and some backup tapes).

It has been pointed out by others that optimising compilers will recognise
expressions with only constants as input and compile the result of evaluating
the expression, not the expression itself. Thus if you multiply something by
2/3rds, the compiled code will multiply by .333333.

If an "at risk" expression is compiled, the resulting object programme will
produce Pentium errors even when run on a 486 - and until the code is
recompiled on a "good" Pentium.

Phil Payne
Sievers Consulting UK
Vice Chair, UK Computer Measurement Group
Phone +44 385 302803
Fax/BBS +44 1536 723021
Fido 2:2503/415
CIS 100012,1660

Vaughan R. Pratt

unread,
Dec 28, 1994, 1:06:55 PM12/28/94
to
In article <3ds7gl$d...@Radon.Stanford.EDU>,
Vaughan R. Pratt <pr...@Sunburn.Stanford.EDU> wrote:
>It would be truly remarkable if there were a single program in the
>world today which has any likelihood of being cross-compiled for
>serious use on other platforms and which contains such an at-risk
>expression and which was not written with the Pentium bug in mind.

...where "other platforms" is not intended to preclude the 486 and
clean Pentiums.
--
Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html

Jeff Miller

unread,
Dec 28, 1994, 10:42:54 AM12/28/94
to
Ph...@sievers.demon.co.uk (Phil Payne) writes:

>It has been pointed out by others that optimising compilers will recognise
>expressions with only constants as input and compile the result of evaluating
>the expression, not the expression itself. Thus if you multiply something by
>2/3rds, the compiled code will multiply by .333333.
>
>If an "at risk" expression is compiled, the resulting object programme will
>produce Pentium errors even when run on a 486 - and until the code is
>recompiled on a "good" Pentium.

As a technical matter, this is true of course.

As a practical matter, any programmer who would release object code today,
compiled on a Pentium, containing floating point constants, is a complete
incompetent dolt.

Jeff Miller
jxm...@borg.mnet.uswest.com
"Performing random acts of
moral ambiguity."

Vaughan R. Pratt

unread,
Dec 28, 1994, 12:33:09 PM12/28/94
to
In article <788611...@sievers.demon.co.uk>,

Phil Payne <Ph...@sievers.demon.co.uk> wrote:
>
>If an "at risk" expression is compiled, the resulting object programme will
>produce Pentium errors even when run on a 486 - and until the code is
>recompiled on a "good" Pentium.

It would be truly remarkable if there were a single program in the


world today which has any likelihood of being cross-compiled for
serious use on other platforms and which contains such an at-risk

expression and which was not written with the Pentium bug in mind. The
expression would have to contain a constant term x/y where the constant
terms x and y evaluate to a bad pair. Such terms are not easily come
by in the first place, and they are very implausible candidates for
constants appearing in a program sufficiently significant that it will
ever be cross-compiled for serious use elsewhere.

The probability of being bitten by bad *data* from a Pentium is many
orders of magnitude greater than by bad *programs* not written with
malice aforethought. Worry in proportion to your risk, or go grey
prematurely.

Vaughan R. Pratt

unread,
Dec 30, 1994, 1:26:58 AM12/30/94
to
Summary: We give an empirically determined upper bound on the time it
would take a chain reaction in a critical mass of Pentiums to melt down
a database, together with the program used to determine it.

In article <hbaker-2712...@192.0.2.1>,


Henry Baker <hba...@netcom.com> wrote:
>In article <3dp96r$h...@mhaal.inhouse.compuserve.com>, "M. L."
><ml...@csi.compuserve.com> wrote:
>> > No, the fallout from the Pentium disaster is likely to be worse than
>> > Chernobyl (in the U.S., at least), since most of the contaminated data
>> > is here.
>> This is without a doubt the most idiotic statement found in the
>> sea of stupidity flowing through this news group.
>Perhaps you've never attempted to do any serious floating point work.
>I once worked at a research hospital where we did multiple regression
>analyses that went on for days. We were attempting to separate out
>the influences from different variables in very noisy data.
>A problem like the Pentium bug is the worst possible problem. It isn't
>large enough to be 'obviously' wrong, but you have to re-do _all_ of
>your calculations.

In the physical sense it is absurd to compare the Pentium with
Chernobyl. I've been sitting beside mine for weeks; had it been
remotely comparable to Chernobyl I wouldn't be able to lift a finger to
type.

Of course that's not what Henry had in mind, he meant a suitable
analogy in the data processing world. What I found myself wondering
was, just how broad does this analogy have to be in order to be apt?
On investigating this I found that the analogy can be taken far more
literally than even M. Lane presumably had in mind.

Chernobyl happened when its operators lost control of the reactor's
chain reaction while conducting an ill-considered experiment. When the
world's Pentiums ship each other flawed data, a natural question to ask
is whether the data processing equivalent of a sustainable and
uncontrolled chain reaction is possible. Continuing the analogy, is
the Pentium's equivalent of a nuclear half-life (the time between
errors) sufficiently short, and the decay products (the error
magnitudes) sufficiently strong that a chain reaction in a sufficiently
dense mass of Pentiums (relative to the number of participating
non-Pentiums) could both start in a reasonable time and sustain
itself?

Now we have seen many examples that trigger the bug with a probability
greater than the official one in nine billion. The most questionable
examples are those based on specific quotients such as 4195835/3145727
and 4.999999/14.999999, or somewhat more generic computations such as
x/(d.d-d.d); many have wondered just how often these examples are
really encountered in everyday computation, while many others are
convinced that they are quite likely enough to present a real risk.
Also questionable are the examples that embed some understanding of the
nonuniformities in the distribution of the bad operands, e.g. my
small-bruised-integer measurements. Just how at-risk really is a
typical program that is ignorant of those nonuniformities?

With these concerns in mind, I designed the following "lab experiment,"
with a Pentium as the lab. The design of the experiment is not at all
specific to the Pentium, and is simple enough both to implement in a
reasonable time and to at least partially understand its nature. Its
purpose is to allow one to observe the rate at which errors appear, and
to study their propagation.

The upshot was that errors appeared about a thousand times sooner than
the random model predicts, and they propagated energetically enough to
sustain a chain reaction essentially analogous to those taking place in
atomic bombs and nuclear reactors.

The experiment simulated a computational world of indeterminate scope,
one that could correspond either to a distributed database run by many
computers or a centralized one run by one computer. This database's
knowledge is obtained partly from "axioms" or "sensory input", i.e.
clean data from outside the scope of the experiment, and partly by
"deduction," i.e. computation performed within its scope. The latter
is where the Pentium-induced damage arises.

I had originally considered taking the "sensory input" to consist of
financial amounts such as $9.98. However the more complicated the
source the more suspect it becomes in terms of reflecting a possible
bias towards Pentium-bad numbers. So in the end I settled for a
constant incoming stream of 1's, on the theory that if the number 1 all
by itself is Pentium-bad, things are much worse than we thought!

"Deduction" was simply the inference of new numbers from previously
obtained ones via arithmetic operations chosen at random. All
divisions were by flawed Pentium, corresponding to all computers
participating in the maintenance of this database being such. I did
not investigate in this experiment what lesser density of Pentiums
might still constitute a critical mass for a chain reaction.

In more detail, I first set up a numeric database consisting of a
single number, namely 1. I then set a flawed Pentium (not mine, it's
been fixed) to work adding new numbers to this database. These numbers
came from two sources. The first source was the database itself; two
numbers were selected at random from the database, then combined by an
operation randomly selected from among the four operations of addition,
subtraction, multiplication, and division, then stored in the
database. The second source was a stream of 1's, constituting "fresh
data" moderating any chain reaction that might start up like a graphite
rod in a reactor core. These sources were weighted equally for the
groups plus-minus, times-divide, and the ones, one third each. This
was achieved simply by viewing these as six sources: four for the
arithmetic operations and two for the stream of 1's treated as a pair
of such streams. These six sources were sampled randomly, more
precisely pseudorandomly with a uniform distribution using the Unix
library integer rand() function.

I did not want the database to fill up with improbably large or tiny
numbers, whose absurd sizes might raise eyebrows; on the other hand I
also did not want an artificial cliff in the distribution, whose
precise location might raise questions. I therefore wrote a macro
ablg(x) to compute (int)(fabs(log_2(x)/32)) fast and used it in
((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) as a fast stochastic
filter of high-magnitude-exponent numbers whose effect is first felt on
positive numbers lying outside [2^-31,2^32] (and symmetrically for
negative numbers). This filter rolls off gently, with probability 1/3
of retaining a number in [2^32,2^96] (and symmetrically in
[2^-95,2^-31]), down to 1/15 for [2^416,2^480] (in general 1/(2n+1) for
[2^{32(2n-1)},2^{32(2n+1)}] for n=1,...,7), and probability 0
thereafter (via the %15).

After the database fills up, the program creates space for each new
number by overwriting a randomly selected old number. Although I have
32Mb, I set the storage limit to 750,000 double-precision numbers
(.75*8*2 = 12 Mbytes, there being two copies of the database) so that
the program would run on 16Mb machines without swapping (the accesses
are highly nonlocal and would cause thrashing if swapping happened).

To measure errors I ran two such databases, clean and dirty, in
lockstep, performing the same operations on both using the same operand
addresses and making all decisions identically for both (e.g the above
stochastic filter resolves disagreement between the clean and dirty
numbers by in effect taking their geometric mean). The only difference
is that clean division uses the Mathworks' Matlab solution. (For this
sort of work a dirty Pentium is much faster than a clean one because it
is much faster to emulate clean division on a dirty Pentium than vice
versa.)

Before giving the results of this experiment it is worth considering
what might plausibly have been expected. Because no effort was made to
produce Pentium-sensitive numbers of the kind familiar from the IBM
report and my small-bruised-integer response functions (bug1 and bug2
in /pub/FDIV/individual.bugs on boole.stanford.edu), one might expect
only one out of every 9 billion divisions, and hence one out of every
6*9 = 54 billion numbers produced, would be in error. One would
therefore expect to wait about four days for the first error, and only
then would the interesting question arise as to whether that error
would have time to clone itself sufficiently before it and its first
few clones were all overwritten by the steady inbound flow of clean
data. Even if the error survived, its expected size would be very
small: only one in every 40 or so errors is in the 5th decimal digit,
and hence occurs only every 160 days at full speed (or every 40*27,000
~ one million years at the 1000 divisions per day rate).

In fact what happened was that my database effectively "melted down"
altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000
numbers having a relative error more than 0.1 and more than 6% having a
relative error more than 1 (i.e. meaningless). Although a few more
cycles would have seen yet further deterioration, my experiment was
programmed to stop at that point: enough is enough!

A "cycle" is defined here as the arrival of just enough numbers to fill
the database, less a few filtered-out oversize numbers. On an isolated
Pentium a cycle thus consists of 750,000 consecutive operations
(including 1's), most of which put a number into the database. My P90
gets through a cycle in 7 seconds; were it not for other issues like
locality and propagation delays a network of a thousand Pentiums would
accomplish a cycle in 7 milliseconds. Meltdown took 32 minutes on my
machine, but if the same operations were to be performed in parallel it
could occur in seconds in a sufficiently large and fast network of
Pentiums. (Ordinarily a sequential machine can create long sequential
dependencies that would invalidate this reasoning by creating long
error propagation paths that parallel computers would not have time to
develop, but the random access pattern of this reactor creates
dependency chains of an expected length sufficiently low as to be
consistent with those arising in much more parallel computation,
usually an essential component of "embarrassingly parallelizable.")

Here is the printout detailing the meltdown. The vertical axis is
time, increasing downwards linearly; each line is printed after 20
cycles (so line 9 comes at the end of cycle 180). The horizontal axis
is space: each of the 50 or so dots on a line denotes 2% of the
database, the whole space being ordered left to right by decreasing
relative error, with hexadecimal digits denoting precision boundaries:
6 for 10^-6, c for 10^-12, etc.

1 0123456789a.b.c.d................................................
2 0123456789.a.b.c.d................................................
3 012345.6.7.8.9.a.b.c.d...............................................
4 012345.6.7.8.9.a.b...c...d..........................................
5 012345.6.7.8.9..a....b.......c....d.................................
6 0123.4.5.6.7.8...9.....a......b.....c..d............................
7 012.3.4.5.6.7...8........9......a....b..c.d...........................
8 012.3.4.5..6....7.......8.......9...a..b.c.d...........................
9 .0.1.2.3.4..5......6.......7.....8...9.a.b.c.d..........................
10 .0.1.2.3..4.....5.........6....7..8.9.a.b.c.d.........................
11 .0.1.2...3......4.......5.....6..7.8.9.a.b.c.d.........................
12 .0.1...2........3......4....5..6.7.8.9.a.b.c.d........................
13 .0....1......2........3...4..5.6.7.8.9.a.b.c.d........................
14 ....0........1......2....3..4.5.6.7.8.9.a.b.c.d........................

The dots immediately following a hex digit or at the start of the line
actually represent anywhere from a single entry to 2% of the database
(whence the slight line-to-line variations in number of dots), the
other dots represent exactly 2%. On line 6, the three dots between 8
and 9 indicate that between 4% and 6% of the database has relative
errors between 10^-8 and 10^-9, while the single dot between 3 and 4 on
line 6 means that there is at least one error in the range 10^-3 down
to 10^-4. Absence of a dot, e.g. between 2 and 3 on line 6, means no
errors at all in the range 10^-2 to 10^-3. Dots to the right of d
denote errors less than 10^-13, including correct results. Because 1/3
of the incoming data consists of 1's, which are automatically correct,
this region does not shrink below 1/3 of the whole database, which is
why d slows down instead of going all the way across the page.

The boundaries start out squished together in the upper left. Each
boundary takes turns rushing across the no-man's land in the middle, to
join its comrades huddling together in the lower right. In terms of
error populations this means that the correctness of the database
deteriorates explosively; errors blow up by an order of magnitude every
twenty cycles or so. The experiment stops when more than 2% of the
data has unit relative error (is meaningless); here it had reached 6-8%
by the time this stopping condition was examined, at which point at
least 20% of the data base had relative errors above 0.1. Had it not
been stopped the error explosion would have continued at the same speed
and the data base would in a few dozen more cycles be a mixture of 1's
and completely meaningless numbers, total meltdown.

Caveats. There are several important caveats in interpreting these
results. The most basic caveat is that any chain reaction, whether
physical or computational, is highly sensitive to the prevailing
conditions. Thus small changes to various parameters may increase or
decrease both the speed of onset of the reaction and its ability to
sustain itself. In particular I would not expect it would take very
much of a dilution, in the form of either non-Pentiums or clean
Pentiums contributing to the process, or a different mix of incoming
data and operations performed on the data, to moderate this reaction to
below the level at which it can sustain itself. Thus few if any
real-world computing environments would be likely to provide the right
conditions for this chain reaction. With the large number and variety
of today's environments however the possibility that some might be at
some risk of "going critical" should be kept in mind.

Offsetting these reassuring considerations is the less reassuring
possibility that the incoming data may include either many small dollar
amounts, shown by the IBM work to be at well above average risk on the
Pentium, or my explicitly bruised integers, e.g. rationals as decimals
truncated by mandate, or data read from a decimal calculator showing
4.999999 or from an analog-to-digital device that is positioned almost
on an integer. When the stream of 1's is replaced by such
Pentium-sensitive data the onset of a chain reaction could conceivably
be much faster, and/or the critical mass of Pentiums (as measured by
the rate of flawed divisions) required to sustain it could conceivably
be much lower.

All these variations almost certainly will have some effect. How much
awaits further experimental investigation, as does the question of
which variations have any likelihood of arising in real computing
environments.

As an independent application, by avoiding Pentium-specific knowledge
the experiment might make a useful general-purpose FPU test when left
to run for a week, provided its "clean" source is obtained from a
reference machine.

Here is the program, which may be freely distributed and modified
without my permission. It runs under Linux but should require little
or no modification for other OS's (on a Pentium of course). It would
ease my mind a bit if when you hack it up you put your name, date, and
hack at the top, in reverse chronological order, so I get neither the
credit nor the blame due you.

==========clip here=== reactor.c ======

/* Reactor simulator; generates chain reaction using Pentium
* division errors as decay products.
* Please log changes/author/date here in reverse chronological order
* Author: V. Pratt, 12/29/94, pr...@cs.stanford.edu
*/
#include <stdlib.h>
#include <math.h>
#include <stdio.h>

#define MAXW 750000 /* steady-state population (should fit in 16Mb) */
#define MAXE 14 /* report error distribution down to 10^-14 */
#define PR 20 /* # cycles between error distribution reports */
#define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4)
/* fast approx. to (int)(fabs(log_2(x)/32)) */

int tick, cycle, W, grp[MAXE+1];
double clean[MAXW], dirty[MAXW];

main(argc, argv)
char *argv[];
{
int opd1, opd2, opr, ans, rnd, relcnt, watch;
int i, j, e;
double cl, di, rel, tem;
watch = rnd = relcnt = 0;
if (argc == 2) /* any argument -> watch errors that */
watch = 1; /* are large relative to recent examples */
rel = 1e-15; /* initial setting for "relatively large" */
/* initialize world with one element, 1.0 */
clean[0] = dirty[0] = 1.0, W = 1;
for (cycle = 1; grp[0] < MAXW/50; cycle++) {
for (tick = 0; tick < MAXW; tick++) {
/* Choose random expression "opd1 opr opd2" */
opd1 = rand()%W;
opd2 = rand()%W;
opr = rand()%6;
/* Evaluate expression */
switch (opr) {
case 0: /* add operands */
cl = clean[opd1] + clean[opd2];
di = dirty[opd1] + dirty[opd2];
break;
case 1: /* subtract operands */
cl = clean[opd1] - clean[opd2];
di = dirty[opd1] - dirty[opd2];
break;
case 2: /* multiply operands */
cl = clean[opd1] * clean[opd2];
di = dirty[opd1] * dirty[opd2];
break;
case 3: /* divide operands */
cl = at_risk(clean[opd2])?
((15./16)*clean[opd1])/
((15./16)*clean[opd2]):
clean[opd1]/clean[opd2];
di = dirty[opd1] / dirty[opd2];
break;
default:/* output 1.0 */
cl = di = 1.0;
break;
}
/* reduce probability of values far from 1 */
if ((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) {
/* result passed, store it */
ans = W<MAXW? W++: rand()%MAXW;
clean[ans] = cl;
dirty[ans] = di;
}
/* if watching, show relatively large discrepancies */
if (watch && fabs(cl-di) > rel*fabs(cl)) {
printf("%4d:%-8d%18.14g %18.14g %7.2g\n",
cycle, tick, cl, di, (cl-di)/cl);
if (((relcnt++)%10) == 0)
rel *= 2;
}
}
/* Every PR cycles, print distribution of errors */
if ((cycle%PR) == 0) {
printf("%4d ", cycle/PR);
for (i = 0; i <= MAXE; i++)
grp[i] = 0;
for (i = 0; i < W; i++) {
tem = (clean[i]-dirty[i])/clean[i];
e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16);
e = e < 0? 0: e > MAXE? MAXE: e;
grp[e]++;
}
for (i = 0; i <= MAXE; i++) {
for (j = 0; j < grp[i]; j += W/48+1)
printf(".");
printf(i<MAXE? "%x": "\n", i);
}
fflush(stdout); /*needed when stdout is a file! Bug? */
}
}
}

at_risk(double x)
{
return (((char*)(&x))[5]&0xfc)==0xfc &&
((0x2492>>(((short*)(&x))[3]&0xf))&1);
}

===== end of reactor.c ==========

Paul Rubin

unread,
Dec 30, 1994, 10:48:34 PM12/30/94
to
In article <3e097i$9...@radon.stanford.edu>,

Vaughan R. Pratt <pr...@Sunburn.Stanford.EDU> wrote:
>In fact what happened was that my database effectively "melted down"
>altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000
>numbers having a relative error more than 0.1 and more than 6% having a
>relative error more than 1 (i.e. meaningless). Although a few more
>cycles would have seen yet further deterioration, my experiment was
>programmed to stop at that point: enough is enough!

Is this an interesting result? It sounds like your program computes
the evolution of what must be a chaotic system (your database) (is
*that* an interesting result?). You've shown that after a certain
amount of time the dirty-Pentium results differ substantially
from correct IEEE results, but the IEEE results might *also*
be totally numerically meaningless.

What happens if you do the same calculation with extended precision
(i.e. storing all numbers as 80-bit floats) and comparing the results
with "clean" double (64-bit) precision? If you have a big enough
machine, how about doing the same thing in multiple precision (128
bits or more) and comparing the results with clean 64- or 80-bit
arithmetic?

By the way, if you're using gcc, you should use -ffloat-store on your
compilation command line to prevent double precision calculations from
being done in a mixture of 64- and 80-bit arithmetic.

Vaughan R. Pratt

unread,
Jan 1, 1995, 4:59:24 AM1/1/95
to
In article <phrD1n...@netcom.com>, Paul Rubin <p...@netcom.com> wrote:
>In article <3e097i$9...@radon.stanford.edu>,
>Vaughan R. Pratt <pr...@Sunburn.Stanford.EDU> wrote:
>>In fact what happened was that my database effectively "melted down"
>>altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000
>
>Is this an interesting result? It sounds like your program computes
>the evolution of what must be a chaotic system (your database) (is
>*that* an interesting result?). You've shown that after a certain
>amount of time the dirty-Pentium results differ substantially
>from correct IEEE results, but the IEEE results might *also*
>be totally numerically meaningless.

Oops. Indeed. It's not terribly interesting to know that A took n
cycles to melt down relative to B if both had long before then melted
down relative to the truth.

>What happens if you do the same calculation with extended precision
>(i.e. storing all numbers as 80-bit floats) and comparing the results
>with "clean" double (64-bit) precision?

Sigh. It turns out that if you take this hit on every instruction (in
contrast to the hit from the FDIV bug, which though much bigger in
magnitude is much smaller in rate, about one in every few million
instructions in this setup as it turns out) then the meltdown of double
precision relative to extended precision takes 160 cycles, in contrast
to 280 cycles for the Pentium error relative to double precision.

I had put off doing this experiment because I had been expecting this
difference to be so small as to take far longer to propagate, but in
practice it is not. In hindsight this was a big mistake.

Databases don't seem to melt down relative to the world, not even on
account of using single precision arithmetic. Why they don't is an
excellent question. I thought about this for a bit and decided that a
basic difference between my database and a real one is that mine is
untyped: it lets any two things be combined in any way. Although we
aren't counting apples and oranges, one could imagine that each of the
1's streaming in counted one kumquat, the type Brian Reid would use
here. Now it is fine to add 1 kumquat plus 1 kumquat to get 2 two
kumquats. And it is also fine to multiply them, in which case you get
one square kumquat. What is bad is to add kumquats to square
kumquats. But it is ok to multiply them, getting cubic kumquats. And
you can divide a kumquat by a kumquat to get a pure number, call this
type *unit*. If you now divide 3 cubic kumquats into 12 units, you get
4 units per cubic kumquat. And so on.

So I converted my untyped reactor into a typed one, somewhat
arbitrarily set limits of plus and minus 3 (cubic and cubic^-1) for the
types, and only permitted correctly typed operations. This seemed to
bring the *size* of numbers under some control; they didn't seem to
want to fluctuate so much. It also postponed the meltdown *a little*,
getting to 200 instead of 160. But this was still much sooner than
with the Pentium numbers, which got to 280 cycles before reaching the
corresponding point.

So what other differences might there be between my database and a real
one? Well, a number that has been obtained by a long chain of
calculations is at great risk of accumulating a lot of errors along the
way. I therefore decided to not trust such numbers and to omit them
from the database. I defined the depth of each incoming 1 as 0, and
the depth of each calculated number as 1 plus the maximum depth of its
operands, and arbitrarily set 100 as the maximum permissible depth.

This had the following dramatic effect (see my original message on
chain reactions for the meaning of this printout).

1 012345.6.7.8.9.a.b.c.d................................................
2 012345.6.7.8.9.a.b..c...d............................................
3 0123456.7.8.9.a.b.c.d................................................
4 01234567.8.9.a.b.c.d................................................
5 0123456.7.8.9.a.b.c.d................................................
6 01234567.8.9.a.b.c.d................................................
7 0123456789.a.b.c.d................................................
8 01234567.8.9.a.b.c.d................................................
9 012345678.9.a.b.c.d................................................
10 012345678.9.a.b.c.d................................................
11 01234567.8.9.a.b.c.d................................................
12 0123456789.a.b.c.d................................................
13 01234567.89.a.b.c.d................................................
14 0123456789a.b.c.d................................................
15 0123456789a.b.c.d................................................
16 012345678.9.a.b.c.d................................................
17 012345678.9.a.b.c.d................................................
18 012345678.9.a.b.c.d................................................
19 01234567.89.a.b.c.d................................................

For the first 20 cycles (line 1) the database started to melt down as
before, with errors reaching above 10^-6. But at cycle 40 (line 2)
things had not gotten any worse, and after that the errors that had
initially rushed down to 10^-5 started to get cleaned out by the
incoming clean 1's. By cycle 140 (line 7) all errors greater than 10-9
were gone. Thereafter occasional errors would creep in, all less than
10^-7, but sometimes there would be no errors bigger than 10^-10, as at
cycles 280 and 300.

So avoiding long dependency chains seems like an excellent strategy for
avoiding database meltdown. Types in contrast merely slowed down the
onset of meltdown by about 20%. I've always been a bit atheistic about
types (10 years at typeless MIT can do that to one :), and I've also
long had a deep mistrust of long proofs (I've never met a long proof I
understood). These quantitative observations about meltdown provide
some kind of confirmation of both of my intuitions.

Now, back to the Pentium bug. With the reference database now acting
more stably, it becomes a much more meaningful exercise to compare
clean and Pentium-flawed division with respect to the rate at which the
latter can melt things down. I chose to run both databases at double
precision, assuming that the outcome would not be very different from
running both at extended precision (= long double) (I don't know how
true this is).

Well, the outcome was mixed. On the one hand, an error appeared on
average once every million divisions, at least for the first 200
cycles. Judging by their behavior (they tended to come in bunches
within a few cycles, with gaps of dozens of error-free cycles in
between) I would estimate that a third of these were FDIV errors and
the other two thirds inherited those errors by arithmetic before being
wiped out by the stream of 1's. This corresponds to 1 FDIV error in 3
million divisions, which is 3,000 times the 1 in 9 billion rate
assuming random data. This is significant given that, although the
incoming data was not at all random, being all 1's, the operands and
operations used in obtaining new values from old were completely
random. So even though no Pentium-specific knowledge at all was built
into this model, erroneous operands were found by the model at a
relatively high rate. The errors covered the spectrum fairly uniformly
(by exponent) from 10^-6 down to 10^-13.

But then I ran the experiment to 400 cycles and no further errors
appeared, diluting all the numbers of the preceding paragraph by a
factor of 2. But then from cycles 402 to 420, 160 errors appeared! So
the number of errors can fluctuate wildly. But this particular big
bunch of errors all stayed within the range 10^-10 to 10^-14, as such
pretty inconsequential.

On the other hand all the above FDIV-induced errors were swept up by
the replacement process much more quickly than they were generated. In
constrast, the constant stream of truncation errors at double precision
as measured relative to extended precision came faster than they could
be swept up, maintaining a population of errors that extended down to
10^-10 with occasional excursions down to 10^-8 after stabilizing, as
we saw above.

I then did another experiment, changing the source of numbers from
constant 1 to random numbers of the form d.d, e.g. 6.7 or 0.2. These
seem particularly bad in the context x/(d.d-d.d), and so might be
expected to give more errors. And in fact cycles 11-36 showed 120
errors; had we not seen the 160 errors in cycles in 402-420 with 1's as
inputs, we might conclude that d.d was much worse than 1's. But then
in cycles 37-200, the only errors were a batch of 9 errors in 74-78 and
one error in cycle 139. The enormous variation within experiments thus
seems to dominate any possible variation one might suspect between
these two experiments. Hence for this reactor program at least one
cannot infer much of an important difference between the sources 1 and
d.d.

My interpretation of all this is that, from a very pragmatic point of
view, Intel could afford to weaken its 1 in 9 billion claim to 1 in a
few million, while at the same time pointing to the distribution of
errors, which are known to occur uniformly across the spectrum of
leading bit 14 down to leading bit 52, as greatly diluting the apparent
seriousness of this thousand times higher rate. In particular, for at
least this random model of data replenishment, such errors propagate
only a little, averaging few offspring in their brief lifetimes, with
however occasional big bursts.

The obvious caveat here is that my reactor program doesn't claim to
represent faithfully *any* real world application, which could perform
either better or worse than reactor.c with regard to their response to
the Pentium FDIV error. For example a nonlinear least-squares fitting
program could reasonably be expected to be far more sensitive to such
errors than this naive model of database maintenance.

Nevertheless it seems to me a very interesting data point serving to
calibrate "high" error rates such as one in a few million (as opposed
to 1 in 9 billion) as to the seriousness of their ultimate impact on
relatively insensitive programs such as this one, namely not very
serious. Despite its artificial and simplistic nature, my program may
be a sufficiently indicative model of what to expect from a "typical"
application, for those times when you *have* to have some expectation
and you have no a priori reason to expect circumstances sufficiently
different as to indicate a substantially different outcome.

I cannot recommend such an expectation for anything that puts either
people or large sums of money at any risk. But for many relatively low
risk applications this sort of reasoning may be a perfectly adequate
justification for continuing to use a buggy Pentium.

Again I enclose the program. Please bring any bugs in it to my
attention, in case they are responsible for any erroneous conclusions.
Remove the comments around the two instances of "long," and simplify
clean division by eliminating the at_risk clause, if you want to do the
double-vs-extended experiment on any system (not necessarily an x86)
that distinguishes long double from double. (On a big-endian system
definitely remove at_risk since it assumes little-endian order.)

=======clip here====reactor.c========

/* Reactor simulator; generates chain reaction using Pentium
* division errors as decay products.

* Author: V. Pratt, 12/29/94, pr...@cs.stanford.edu
*/
#include <stdlib.h>
#include <math.h>
#include <stdio.h>

#define MAXW 750000 /* steady-state population (should fit in 16Mb) */
#define MAXE 14 /* report error distribution down to 10^-14 */
#define PR 20 /* # cycles between error distribution reports */

#define TYPLIM 3 /* no types higher than cubic kq or per cubic kq */
#define MAXDEP 100 /* no dependency chains longer than 100 */


#define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4)
/* fast approx. to (int)(fabs(log_2(x)/32)) */

int tick, cycle, W, grp[MAXE+1];

/* long */ double clean[MAXW];
double dirty[MAXW];
char typ[MAXW];
int depth[MAXW];

main(argc, argv)
char *argv[];
{
int opd1, opd2, opr, ans, rnd, relcnt, watch;

int i, j, e, ty, de;
/* long */ double cl;
double di, cld, rel, tem;


watch = rnd = relcnt = 0;
if (argc == 2) /* any argument -> watch errors that */
watch = 1; /* are large relative to recent examples */

rel = 1e-16; /* initial setting for "relatively large" */


/* initialize world with one element, 1.0 */

clean[0] = dirty[0] = typ[0] = W = 1;
depth[0] = 0;


for (cycle = 1; grp[0] < MAXW/50; cycle++) {
for (tick = 0; tick < MAXW; tick++) {
/* Choose random expression "opd1 opr opd2" */
opd1 = rand()%W;

while (depth[opd1] >= MAXDEP && ++opd1 < W);
if (opd1 == W)
continue;


opd2 = rand()%W;
opr = rand()%6;

ty = typ[opd1];
i = -1;
/* require correct types and bounded depth */
switch (opr) {
case 0: case 1:
while ((typ[opd2] != ty ||
depth[opd2] >= MAXDEP) && ++opd2 < W);
break;
case 2: i = 1;
case 3: do { ty = typ[opd1] + i*typ[opd2];
} while ((ty<-TYPLIM || TYPLIM<ty ||
depth[opd2] >= MAXDEP) && ++opd2<W);
break;
default:ty = 1;
de = 0;
break;
}
if (opd2 == W)
continue;
if (opr < 4)
de = 1+(depth[opd1]>depth[opd2]?
depth[opd1]: depth[opd2]);
/* Evaluate expression */
cl = di = 0;


switch (opr) {
case 0: /* add operands */
cl = clean[opd1] + clean[opd2];
di = dirty[opd1] + dirty[opd2];
break;
case 1: /* subtract operands */
cl = clean[opd1] - clean[opd2];
di = dirty[opd1] - dirty[opd2];
break;
case 2: /* multiply operands */
cl = clean[opd1] * clean[opd2];
di = dirty[opd1] * dirty[opd2];
break;
case 3: /* divide operands */
cl = at_risk(clean[opd2])?
((15./16)*clean[opd1])/
((15./16)*clean[opd2]):
clean[opd1]/clean[opd2];
di = dirty[opd1] / dirty[opd2];
break;
default:/* output 1.0 */
cl = di = 1.0;
break;
}

cld = cl;


/* reduce probability of values far from 1

if ((rnd=(rnd%15)+1)%(ablg(cld)+ablg(di)+1) == 0) { */
if (!(ablg(cld) || ablg(di))) {


/* result passed, store it */
ans = W<MAXW? W++: rand()%MAXW;
clean[ans] = cl;
dirty[ans] = di;

typ[ans] = ty;
depth[ans] = de;
/* if watching show discrepancies */
if (watch && fabs(cld-di) > rel*fabs(cld)) {
printf("%4d:%-8d%2d %18.14g %18.14g %7.2g%12d\n",
cycle, tick, ty, cld,
di, (cld-di)/cld, depth[ans]);
fflush(stdout);


if (((relcnt++)%10) == 0)
rel *= 2;
}
}
}
/* Every PR cycles, print distribution of errors */
if ((cycle%PR) == 0) {
printf("%4d ", cycle/PR);
for (i = 0; i <= MAXE; i++)
grp[i] = 0;
for (i = 0; i < W; i++) {
tem = (clean[i]-dirty[i])/clean[i];
e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16);
e = e < 0? 0: e > MAXE? MAXE: e;
grp[e]++;
}
for (i = 0; i <= MAXE; i++) {
for (j = 0; j < grp[i]; j += W/48+1)
printf(".");
printf(i<MAXE? "%x": "\n", i);
}
fflush(stdout);
}
}
}

at_risk(double x)


{
return (((char*)(&x))[5]&0xfc)==0xfc &&
((0x2492>>(((short*)(&x))[3]&0xf))&1);
}

=========end of reactor.c==========
--
Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS
http://boole.stanford.edu/boole.html

Anil Das

unread,
Jan 1, 1995, 6:03:26 AM1/1/95
to
In article <phrD1n...@netcom.com>, p...@netcom.com (Paul Rubin) writes:
> What happens if you do the same calculation with extended precision
> (i.e. storing all numbers as 80-bit floats) and comparing the results
> with "clean" double (64-bit) precision? If you have a big enough
> machine, how about doing the same thing in multiple precision (128
> bits or more) and comparing the results with clean 64- or 80-bit
> arithmetic?

I think Paul raised the most important question, but:

My understanding of Professor Pratt's experiment
says that one cycle is 750000 operations, so 280 cycles
is 21E7. This is the total number of operations for 280 cycles;
the length of the chain of operations giving any one entry
in the database would be much less.

With (correct) double precision math the only way
21E7 operations can give relative errors more than 1
(or even 1E-8) is for a number of subtractions of nearly equal
quantities to occur. Maybe the experiment should be rerun after
removing the subtract operation?

I also have two minor questions for the
famous empirical cyber entomologist of the group.

a) Has any attempt been made to run the experiment
with a different random number generator or the same random
number generator with a different seed? That will show how
sensitive the results are to initial conditions. The more
sensitive, the more chaotic the whole thing is.

b) Is a count of the actual number of erroneous divisions
that occured available? case 3: could be re-written like:

di = dirty[opd1] / dirty[opd2];

if (at_risk(clean[opd2])) {
cl = ((15./16)*clean[opd1]) / ((15./16)*clean[opd2]) ;
if (cl != di && dirty[opd1] == clean[opd1]
&& dirty[opd2] == clean[opd2])
bad_division_count++ ;
}
else {
cl = clean[opd1]/clean[opd2];
}
break ;

to the count the number of `significant' bad divisions. (This will
not count the bad divisions that occur when the operands are
already corrupted).

I don't have a buggy Pentium, or I would have run the experiments
myself.

--
Anil Das an...@nskernel.tandem.com

Vaughan R. Pratt

unread,
Jan 1, 1995, 3:58:06 PM1/1/95
to
In article <D1q41...@dsg.tandem.com>,

Anil Das <an...@nskernel.tandem.com> wrote:
>
> With (correct) double precision math the only way
>21E7 operations can give relative errors more than 1
>(or even 1E-8) is for a number of subtractions of nearly equal
>quantities to occur. Maybe the experiment should be rerun after
>removing the subtract operation?

While ten million is a theoretical upper bound on chain lengths, the
observed upper bound was roughly 200,000 times less (idle conjecture:
750,000/e less?). This was still enough to let the considerable
heating effect of subtraction do its work, to my great surprise not to
mention mortification (*so* embarrassing...).

I'm not sure how meaningful the results are without subtraction, but I
tried it anyway for both situations, double vs. extended precision.
and Pentium divide vs. correct. For the former the (many) errors got
no worse than 5*10^-14 at 280 cycles, i.e. not much error propagation.
For the latter, in 240 cycles there were 459 errors, i.e. not many and
they kept being swept up within a few cycles. Most were under 10^-10,
but there were five relatively large ones: 1.5e-08, 3.4e-06, 6e-08,
4.3e-08, and 6e-08.

Other experiments suggest themselves here, such as greatly diluting the
1's to see what happens when their mopping-up or moderating influence
is removed; those few but large errors mentioned just above might make
a much bigger nuisance of themselves then. However a more meaningful
experiment than simply removing subtraction might be to allow it but
put some limit on how small the difference can get relative to the
operands, e.g. .001 or .5. This could be taken as a crude model of the
improved stability obtained for ill-conditioned data by careful
pivoting, Householder triangulations, etc. I would guess that the .001
limit would not make a big difference but that .5 would not be terribly
different from no subtraction at all with regard to rate of meltdown.
Too many fun experiments possible, too little time.

> a) Has any attempt been made to run the experiment
>with a different random number generator or the same random
>number generator with a different seed? That will show how
>sensitive the results are to initial conditions. The more
>sensitive, the more chaotic the whole thing is.

I ran quite a few experiments with other parameters being varied, which
achieves the same end. In general the setup is quite chaotic as I
indicated under "Caveats" in my original message. I did in fact try
varying only the seed early on, just to be sure, and saw the same
chaotic variations as when varying other parameters.

> b) Is a count of the actual number of erroneous divisions
>that occured available?
>

> di = dirty[opd1] / dirty[opd2];
> if (at_risk(clean[opd2])) {
> cl = ((15./16)*clean[opd1]) / ((15./16)*clean[opd2]) ;
> if (cl != di && dirty[opd1] == clean[opd1]
> && dirty[opd2] == clean[opd2])
> bad_division_count++ ;

No, something like this would be a good addition, to distinguish
original from inherited FDIV errors. But I would prefer to count all
FDIV bugs, not just those on clean data as per the above, since it is
possible for the bug to "feed on itself." Tim Coe showed me a very
nice example of this, which I've expressed here in C:

main()
{
int i;
double x = 9.2, y = 19.4;
x -= 4.2, y -= 4.4;
for (i = 0; i < 4; i++) {
printf("want %20.15f\n got %20.15f %g\n",
(1.01*x)/(1.01*y), x/y, (x - (x/y)*y)/x);
x = (x/y)*y;
}
}

This starts out with x = 9.2 - 4.2, y = 19.4 - 4.4, and repeatedly
divides x by y and then multiplies the y back in. Here's how the
quotient and relative error looks after doing this up to 4 times---it
stabilizes at the 4th step at a relative error of 3*10^-6.

want 0.333333333333333
got 0.333333333332363 2.91038e-12
want 0.333333333332363
got 0.333333329358720 1.19209e-08
want 0.333333329358720
got 0.333332312106116 3.05176e-06
want 0.333332312106116
got 0.333332312106116 0

Your term "cyberentomologist" has a catchy ring to it.

Michael Schmidt

unread,
Jan 2, 1995, 4:43:00 PM1/2/95
to
In article <3e5uds$9...@Radon.Stanford.EDU>
pr...@Sunburn.Stanford.EDU (Vaughan R. Pratt)
wrote:

...some stuff deleted...


> along the way. I therefore decided to not trust such numbers and
> to omit them from the database. I defined the depth of each
> incoming 1 as 0, and the depth of each calculated number as 1 plus
> the maximum depth of its operands, and arbitrarily set 100 as the
> maximum permissible depth.
>
> This had the following dramatic effect (see my original message on
> chain reactions for the meaning of this printout).
>
> 1 012345.6.7.8.9.a.b.c.d......................................

...some stuff deleted...

I think it's important enough and would be fine (I can imagine
enough others would appreciate it too) if you could repost your
whole original message, where the meanings are explained, to the
comp.sys.intel newsgroup.

------------------\
Michael Schmidt | msch...@me-tech.pfm-mainz.de : Email
ME Technology | ++49 +6741 7126 : Fax
QNX Products \----------- SCSI and Connectivity
## CrossPoint v3.02 ##

Vaughan R. Pratt

unread,
Jan 2, 1995, 7:24:29 PM1/2/95
to
In article <5d5eE...@me-tech.PFM-Mainz.DE>,

Michael Schmidt <msch...@me-tech.PFM-Mainz.DE> wrote:
> In article <3e5uds$9...@Radon.Stanford.EDU>
> pr...@Sunburn.Stanford.EDU (Vaughan R. Pratt)
>>
>> This had the following dramatic effect (see my original message on
>> chain reactions for the meaning of this printout).
>>
>> 1 012345.6.7.8.9.a.b.c.d......................................
>...some stuff deleted...
>
>I think it's important enough and would be fine (I can imagine
>enough others would appreciate it too) if you could repost your
>whole original message, where the meanings are explained, to the
>comp.sys.intel newsgroup.

Let me just do it from scratch. Anyone who wants the original message
can get it by anonymous ftp as bug65, available in
/pub/FDIV/individual.bugs on boole.stanford.edu. The above example
comes from bug66.

For starters let me just read it out loud. "At cycle 20 (= line 1) the
database is only mildly corrupted, with a very small scattering of
errors across much of the spectrum, none greater than 10^-5 though it
contains at least one relative error greater than 10^-6. At least 98%
of the database is intact in the sense of containing no relative errors
larger than 10^-13."

Now here's how I got that out of the above.

Each nonleading dot in a run of dots represents about 2% of the
database entries (100/48 to be exact), arranged in order of increasing
accuracy (decreasing relative error) from left to right. (2% of
750,000 entries is 15,000 errors.) The leading dot in each run denotes
less, namely anywhere from a single entry to 2% of the database.

Each hex digit (0-9a-d) marks an error size boundary; digit 8 is the
boundary for relative errors of size about 10^-8. (Note that since
digits take up space they displace dots, which therefore move around a
little as an artifact of this method of display, depending on where the
digits are concentrated.) Adjacent hex digits, e.g. 23, indicate
complete absence of errors in that decade, namely between 10^-2 and 10^-3.

So in the line above, the most erroneous entry is between 10^-5 and
10^-6, and it is at least consistent with these rules, though very
unlikely, that there are only 8 entries to the left of d (i.e. less
accurate than 10^-13). The other extreme would seem to be that 16% of
the database is left of d. This seems like pretty crummy notation, and
it was really only meant for eyeballing the general effect, but there
is however a trick to reading more into this. There were 48
non-leading dots in the above line in my original message (you deleted
ten), all to the right of d, but that accounts for the whole database
to within a dot. Hence less than 2% of the entries can be left of the
d.

The number at the start of the line is the cycle number divided by 20
(more recently I changed the program to not divide by 20). (A cycle is
the number of steps required to refresh the database with as many
numbers as fit in the database; Mercedes claims their climate control
does this to the car's air in 20 seconds, but they certainly don't
claim to have replaced every molecule in that time, or even in 20
minutes = 60 cycles.) So the example is a snapshot at the end of cycle
20.

At cycle 280 of the same run we see

14 0123456789a.b.c.d................................................

which indicates that all errors above 10^-10 have been swept out of the
database, mostly by the incoming 1's constituting 1/3 of the incoming
numbers (computing x/x will also do this but with very low probability;
x-x will not do it at all because 0's are discarded along with other
quantities below 2^-32). The source of errors in that run were
double-precision truncations, measured using extended precision as the
standard, and they came from the arithmetic operations supplying 2/3's
of the incoming numbers. Though tiny they were enormous in quantity
compared to FDIV errors, giving them enough "momentum" to sweep down to
10^-10 by being magnified by arithmetic before the diluting effect of
the 1's was finally able to sweep them all away. But this is a chaotic
process and some errors would from time to time reach 10^-7, though
after cycle 120 none got past that threshold (at least none were there
at each 20-cycle snapshot).

I learned a lot about doing experiments from this experiment, but the
most useful thing I learned about the Pentium itself was I think the
fact that if you start from low precision data (and it doesn't seem to
matter too much whether it is exclusively the number 1, or random
decimal numbers of the form d.dd) and compute at random while
respecting type (don't add kumquats to square kumquats) and keeping
dependency chains short (< 100), then the kind of numbers you end up
with, when selected among at random and fed into FDIV, trigger the bug
every few million divisions. This is three orders of magnitude higher
than for random data in Intel's sense, operands selected from the 2^23
mantissas with equal probability. I'd like to measure this more
carefully and under a greater variety of conditions just to be sure,
but I certainly did not have to wait 24 days for an FDIV error as the
IBM report suggests, they popped up every couple of minutes or so!
27,000 years? Give me a break.

James Finn

unread,
Jan 2, 1995, 10:51:26 PM1/2/95
to
Phil Payne (Ph...@sievers.demon.co.uk) wrote:
> It has been pointed out by others that optimising compilers will recognise
> expressions with only constants as input and compile the result of evaluating
> the expression, not the expression itself.

In C, this will always be done, whether or not you are running
an optimizer. C allows "constant expressions" (e.g., 7.2 + 8.3)
pretty much anyplace that a constant is required. For example,
you can declare an array with

#define STRSIZE 80
char s[STRSIZE + 1];

and the compiler is expected to evaluate the constant expression
at compile time. Optimizers may compound the problem, but avoiding
them will not solve it.

--James

David Seal

unread,
Jan 3, 1995, 8:28:28 AM1/3/95
to
pr...@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:

>Other experiments suggest themselves here, such as greatly diluting the
>1's to see what happens when their mopping-up or moderating influence
>is removed; those few but large errors mentioned just above might make
>a much bigger nuisance of themselves then. However a more meaningful
>experiment than simply removing subtraction might be to allow it but
>put some limit on how small the difference can get relative to the
>operands, e.g. .001 or .5. This could be taken as a crude model of the
>improved stability obtained for ill-conditioned data by careful
>pivoting, Householder triangulations, etc. I would guess that the .001
>limit would not make a big difference but that .5 would not be terribly
>different from no subtraction at all with regard to rate of meltdown.
>Too many fun experiments possible, too little time.

One that suggests itself to me: run three versions in parallel, one
for buggy Pentium, one for correct double precision and one for
correct extended precision. Prune a calculation out of the database if
the inaccuracy of the correct double precision result relative to the
correct extended precision result becomes too large.

The rationale behind this: it is an attempt to model the (somewhat
questionable) assumption that the programs people run have been
designed to be well-behaved in the presence of double precision
rounding errors: as long as double precision remains accurate relative
to extended precision, the calculation is likely to be one that you
could reasonably expect to come up in people's programs. (Obviously,
this isn't completely true: sometimes it will accept a numerically
unstable calculation just because it happens not to produce a bad
error for the particular operand values it receives. But it should
prune out most of the unstable calculations.) Observing what
additional errors are produced by the Pentium bug in an environment in
which ordinary rounding errors are kept under control might give a
better understanding of what the effects of the Pentium bug are likely
to be in the real world.

Sorry about the fact that I'm only suggesting an experiment, not doing
it and reporting the results: unfortunately (or should I say
fortunately?), I don't have a Pentium to do the experiments on!

David Seal
ds...@armltd.co.uk

Vaughan R. Pratt

unread,
Jan 3, 1995, 1:15:14 PM1/3/95
to
In article <3ebjds$d...@doc.armltd.co.uk>,
David Seal <ds...@armltd.co.uk> wrote:

>pr...@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:
>
>>Other experiments suggest themselves here, such as greatly diluting the

>>...


>>Too many fun experiments possible, too little time.
>
>One that suggests itself to me: run three versions in parallel, one
>for buggy Pentium, one for correct double precision and one for
>correct extended precision. Prune a calculation out of the database if
>the inaccuracy of the correct double precision result relative to the
>correct extended precision result becomes too large.
>
>The rationale behind this: it is an attempt to model the (somewhat
>questionable) assumption that the programs people run have been
>designed to be well-behaved in the presence of double precision
>rounding errors: as long as double precision remains accurate relative
>to extended precision, the calculation is likely to be one that you
>could reasonably expect to come up in people's programs.

A roughly equivalent effect turned out to be achieved by keeping
dependency chains below 100. Above 200 the double-precision-truncation
errors were blown up to the "danger region" of relative errors above
10^-5. (Basis: this experiment is intended to factor in considerations
of error propagation, and not too many engineering applications depend
directly on the 6th decimal digit.)

Pruning only those computations actually creating these errors rather
than the more draconian measure of pruning all that are at any risk of
them due to much arithmetic may possibly give different results. I'm
not sure which corresponds more closely to actual computational
practice; I suppose one could interpret careful pivoting as being of
the former kind, but other computations avoid this sort of magnification
by avoiding excessively chaining of dependencies.

David Seal

unread,
Jan 5, 1995, 10:13:41 AM1/5/95
to

I'm a bit worried by the technique of restricting dependency chains to
be no more than 100 in length (or any other arbitrary figure) because
I believe that such a limit automatically means that you will conclude
that "chain reactions" and "meltdown" cannot occur. Basically, for
either of them to occur, a Pentium error must have the possibility of
perturbing infinitely many future calculations: the maximum dependency
chain length together with random replacement of the database items
together mean that with probability 1, this doesn't happen.

Limiting dependency chain lengths is certainly a useful tool in
practice to stop rounding errors growing into significant errors in
engineering terms, but it is certainly neither sufficient (e.g. the
calculation ((1+1+1+1)/(1+1+1) - 1/(1+1+1)) - 1 has a short dependency
chain length, but a very large relative error due to excessive
cancellation) nor necessary (a "converging approximations" calculation
may have an arbitrarily long dependency chain without becoming
excessively inaccurate, due to the fact that each step reduces the
errors due to previous steps).

The real restriction on the calculations people actually do is that
they should not suffer excessive errors, rather than that they should
conform to any single technique for achieving this goal. This was the
rationale behind my suggestion of comparing double and extended
precision calculations to determine whether the calculation should be
pruned. As I said, it isn't perfect; it should however allow the
indefinitely long dependency chains which are necessary to make a
"chain reaction" possible, while at the same time excluding some quite
short dependency chain calculations that no-one in their right mind
would perform in practice.

David Seal
ds...@armltd.co.uk

0 new messages