Ariane 5 failure

@@ robin

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

>I read the following message from my co-workers that I thought was
>interesting. So I'm forwarding it to here.

>(begin quote)
>Ariane 5 failure was attributed to a faulty DOUBLE -> INT conversion
>(as the proximate cause) in some ADA code in the inertial guidance
>system. Diagnostic error messages from the (faulty) inertial guidance
>system software were interpreted by the steering system as valid data.

>English text of the inquiry board's findings is at
> http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
>(end quote)

>Amara Graps email: agr...@netcom.com
>Computational Physics vita: finger agr...@best.com

THere's a little more to it . . .

The unchecked data conversion in the Ada program resulted
in the shutdown of the computer. The backup computer had
already shut down a whisker of a second before, Consequently,
the on-board computer was unable to switch to the backup, and
used the error codes from the shutdown computer as
flight data.

This is not the first time that such a programming error
(integer out of range) has occurred.

In 1981, the manned STS-2 was preparing to take off, but because
some fuel was accidentally spilt and some tiles accidentally
dislodged, takeoff was delayed by a month.

During that time, the astronauts decided to get in some
more practice with the simulator.

During a simulated descent, the 4 computing systems (the main
and the 3 backups) got stuck in a loop, with the complete
loss of control.

The cause? An integer out of range -- the same problem
as with Ariane 5, where an integer became out of range.

In the STS-2 case, the precise cause was a computed GOTO
with a bad index (similar to a CASE statement without
an OTHERWISE clause).

In both cases, the programing error could have been detected
with a simple test, but in both cases, no test was included.

One would have thought that having had one failure (at least)
for integer out-of-range, that the implementors of the software
for Ariane 5 would have been extra careful in ensuring that
all data conversions were within range -- since any kind
of interrupt would result in destruction of the spacecraft.

There's a case for a review of the programming language used.

Michel OLAGNON

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <52a572$9...@goanna.cs.rmit.edu.au>, r...@goanna.cs.rmit.edu.au (@@ robin) writes:
>[reports of Ariane and STS-2 bugs deleted]

>
>
>In both cases, the programing error could have been detected
>with a simple test, but in both cases, no test was included.
>
>One would have thought that having had one failure (at least)
>for integer out-of-range, that the implementors of the software
>for Ariane 5 would have been extra careful in ensuring that
>all data conversions were within range -- since any kind
>of interrupt would result in destruction of the spacecraft.
>

May be the main reason for the lack of testing and care was
that the conversion exception could only occur after lift off,
and that that particular piece of program was of no use after
lift off. It was only kept running for 50 s in order to
speed up countdown restart in case of an interruption between
H0-9 and H0-5.

Conclusion: Never compute values that are of no use when you can
avoid it !

>There's a case for a review of the programming language used.

Michel
--
| Michel OLAGNON email : Michel....@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Byron Kauffman

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

Michel OLAGNON wrote:
>
> May be the main reason for the lack of testing and care was
> that the conversion exception could only occur after lift off,
> and that that particular piece of program was of no use after
> lift off. It was only kept running for 50 s in order to
> speed up countdown restart in case of an interruption between
> H0-9 and H0-5.
>
> Conclusion: Never compute values that are of no use when you can
> avoid it !
>
> >There's a case for a review of the programming language used.
>
> Michel
> --
> | Michel OLAGNON email : Michel....@ifremer.fr|
> | IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Of course, Michel, you've got a great point, but let me give you some
advice,
assuming you haven't read this thread for the last few months (seems
like years). Robin's whole point is that he firmly believes that the
problem would not have occurred if PL/I had been used instead of Ada.
Several EXTREMELY competent and experienced engineers who actually have
written flight-control software have patiently, and in some cases
(though I can't blame them) impatiently attempted to explain the
situation - that this was a bad design/management decision combined with
a fatal oversight in testing - to this poor student, but alas, to no
avail.

My advice, Michel - blow it off and don't let ++robin (or is it
@@robin?) get
to you, because "++robin" is actually an alias for John Cleese. He's
gathering material for a sequel to "The Argument Sketch"... :-)

A. Grant

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <32492E...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
>Several EXTREMELY competent and experienced engineers who actually have
>written flight-control software have patiently, and in some cases
>(though I can't blame them) impatiently attempted to explain the
>situation - that this was a bad design/management decision combined with
>a fatal oversight in testing - to this poor student, but alas, to no
>avail.

Robin is not a student. He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.

Bob Kitzberger

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

@@ robin (r...@goanna.cs.rmit.edu.au) wrote:
: The cause? An integer out of range -- the same problem

: as with Ariane 5, where an integer became out of range.

...
: There's a case for a review of the programming language used.

Why do you persist?

Ada _has_ range checks built into the language. They were explicitly
disabled in this case.

What are you failing to grasp?

--
Bob Kitzberger Rational Software Corporation r...@rational.com
http://www.rational.com http://www.rational.com/pst/products/testmate.html

Chris Morgan

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <ag129.804...@ucs.cam.ac.uk> ag...@ucs.cam.ac.uk
(A. Grant) writes:

Robin is not a student. He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.

I'm tempted to say "not so reputable to readers of this newsgroup"
after the ridiculous statements made by Robin w.r.t. Ariane 5 but
Richard A. O'Keefe's regular excellent postings more than balance them
out.

Chris
--
--
Chris Morgan |email c...@mihalis.demon.co.uk (home)
http://www.mihalis.demon.co.uk/ | or chris....@baesema.co.uk (work)

Ken Garlington

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

A. Grant wrote:
> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

When it comes to building embedded safety-critical systems, trust me:
He's a student!

--
LMTAS - "Our Brand Means Quality"

Ronald Kunne

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

In article <52bm1c$g...@rational.rational.com>

r...@rational.com (Bob Kitzberger) writes:

>Ada _has_ range checks built into the language. They were explicitly
>disabled in this case.

The problem of constructing bug-free real-time software seems to me
a trade-off between safety and speed of execution (and maybe available
memory?). In other words: including tests on array boundaries might
make the code saver, but also slower.

Comments?

Greetings,
Ronald

Byron Kauffman

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

A. Grant wrote:
>
> In article <32492E...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
> >Several EXTREMELY competent and experienced engineers who actually have
> >written flight-control software have patiently, and in some cases
> >(though I can't blame them) impatiently attempted to explain the
> >situation - that this was a bad design/management decision combined with
> >a fatal oversight in testing - to this poor student, but alas, to no
> >avail.
>

> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

A. -

Thank you for confirming my long-held theory that those who inhabit the
ivory towers
of engineering/CS academia should spend 2 of every 5 years working at a
real job out
in the real world. My intent is not to slam professors who are in touch
with reality,
of course (e.g., Feldman, Dewar, et al), but the idealistic theoretical
side often
is a far cry from the practical, just-get-it-done world we have to deal
with once
we're out of school.

I just KNOW there's a good Dilbert strip here somewhere...

Sandy McPherson

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

A. Grant wrote:
>
> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

Why doesn't he wise up and act like one then?

I don't know the man, and I suspect he has been winding everybody up
just for a laugh. But, if this is not the case, the thought of such a
closed mind teaching students is quite horrific.

"Use PL/I mate, you'll be tucker",

--
Sandy McPherson MBCS CEng. tel: +31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk

Matthew Heaney

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

In article <1780E84...@frcpn11.in2p3.fr>, KU...@frcpn11.in2p3.fr
(Ronald Kunne) wrote:

Why, yes. If the rocket blows up, at the cost of millions of dollars, then
I'm not clear what the value of "faster execution" is. The rocket's gone,
so what difference does it make how fast the code executed? If you left
the range checks in, your code would be *marginally* slower, but you'd
still have your rocket, now wouldn't you?

>Ronald

Matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mhe...@ni.net
(818) 985-1271

Wayne Hayes

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <mheaney-ya0231800...@news.ni.net>,

Matthew Heaney <mhe...@ni.net> wrote:
>Why, yes. If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is. The rocket's gone,
>so what difference does it make how fast the code executed? If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

You have a moot point. In this case, catching the error wouldn't have
helped. The out-of-bounds error happened in a piece of code designed
for the Ariane-4, in which it was *physically impossible* for the value
to overflow (the Ariane-4 didn't go that fast, and it was a velocity
variable). Then the code was used, as-is, in the Ariane-5, without an
analysis of how the code would react in the new hardware, which flew
faster. Had the analysis been done, they wouldn't have added bounds
checking, they would have modified the code to actually *work*, because
they would have realized that the code was *guaranteed* to fail on the
first flight.

--
"And a woman needs a man... || Wayne Hayes, wa...@cs.utoronto.ca
like a fish needs a bicycle..." || Astrophysics & Computer Science
-- U2 (apparently quoting Gloria Steinem?) || http://www.cs.utoronto.ca/~wayne

Alan Brain

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.
>
> Comments?

Bug-free software is not a reasonable criterion for success in a
safety-critical system, IMHO. A good program should meet the
requirements for safety etc despite bugs. Also despite hardware
failures, soft failures, and so on. A really good safety-critical
program should be remarkably difficult to de-bug, as the only way you
know it's got a major problem is by examining the error log, and
calculating that it's performance is below theoretical expectations.

And if it runs too slow, many times in the real-world you can spend 2
years of development time and many megabucks kludging the software, or
wait 12 months and get the new 400 Mhz chip instead of your current 133.

---------------------- <> <> How doth the little Crocodile
| Alan & Carmel Brain| xxxxx Improve his shining tail?
| Canberra Australia | xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo oo oo oo
By pulling Maerklin Wagons, in 1/220 Scale

Ronald Kunne

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <mheaney-ya0231800...@news.ni.net>

mhe...@ni.net (Matthew Heaney) writes:

>>The problem of constructing bug-free real-time software seems to me
>>a trade-off between safety and speed of execution (and maybe available
>>memory?). In other words: including tests on array boundaries might
>>make the code saver, but also slower.

>Why, yes. If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is. The rocket's gone,
>so what difference does it make how fast the code executed? If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

Despite the sarcasm, I will elaborate.

Suppose an array goes from 0 to 100, and the calculated index is known
not to go outside this range. Why would one insist on putting the
range test in, which will slow down the code? This might be a problem
if the particular piece of code is heavily used, and the code executes
too slowly otherwise. "Marginally slower" if it happens only once, but
such checks on indices and function arguments (like squareroots), are
necessary *everywhere* in code, if one is consequent.

Actually, this was the case here: the code was taken from an Ariane 4
code where it was physically impossible that the index would go out
of range: a test would have been a waste of time.
Unfortunately this was no longer the case in the Ariane 5.

Friendly greetings,
Ronald Kunne

A. Grant

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <324A7C...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
>A. Grant wrote:
>> Robin is not a student. He is a senior lecturer at the Royal
>> Melbourne Institute of Technology, a highly reputable institution.

>Thank you for confirming my long-held theory that those who inhabit the

>ivory towers of engineering/CS academia should spend 2 of every 5 years
>working at a real job out in the real world. My intent is not to slam
>professors who are in touch with reality, of course (e.g., Feldman,
>Dewar, et al), but the idealistic theoretical side often is a far cry
>from the practical, just-get-it-done world we have to deal with once
>we're out of school.

You're being a bit hard on theoretical computer scientists here.
Just because it's called computer science doesn't mean it has to be
able to instantly make money on real computers. And the Ariane 5
failure was due to pragmatism (reusing old stuff to save money)
not idealism (applying theoretical proofs of correctness).

But in any case RMIT is noted for its involvement with industry.
(I used to work for a start-up company out of RMIT premises.)
If PL/I is being pushed by RMIT it's probably because the DP
managers in Collins St. want it. Australia doesn't have much call
for aerospace systems.

Ken Garlington

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:
>
> In article <52bm1c$g...@rational.rational.com>
> r...@rational.com (Bob Kitzberger) writes:
>
> >Ada _has_ range checks built into the language. They were explicitly
> >disabled in this case.
>

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.

Particularly for fail-operate systems that must continue to function in
harsh environments, memory and throughput can be tight. This usually happens
because the system must continue to operate on emergency power and/or
cooling. At least until recently, the processing systems that had lots of
memory and CPU power also had larger power and cooling requirements, so they
couldn't always be used in this class of systems. (That's changing, somewhat.) So,
the tradeoff you describe can occur.

The trade-off I find even more interesting is the safety gained from
adding extra features vs. the safety _lost_ by adding those features. Every
time you add a check, whether it's an explicit check or one automatically
generated by the compiler, you have to have some way to gain confidence that
the check will not only work, but won't create some side-effect that causes
a different problem. The effort expended to get confidence for that additional
feature is effort that can't be spent gaining assurance of other features in
the system, assuming finite resources. There is no magic formula I've ever
seen to make that trade-off - ultimately, it's human judgement.

John McCabe

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

r...@goanna.cs.rmit.edu.au (@@ robin) wrote:

<..snip..>

Just a point for your information. From clari.tw.space:

"An inquiry board investigating the explosion concluded in
July that the failure was caused by software design errors in a
guidance system."

Note software DESIGN errors - not programming errors.

Best Regards
John McCabe <jo...@assen.demon.co.uk>

Lawrence Foard

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:
>
> Actually, this was the case here: the code was taken from an Ariane 4
> code where it was physically impossible that the index would go out
> of range: a test would have been a waste of time.
> Unfortunately this was no longer the case in the Ariane 5.

Actually it would still present a danger on Ariane 4. If the sensor
which apparently was no longer needed during flight became defective,
then you could get a value out of range.

--
The virgin birth of Pythagoras via Apollo. The martyrdom of
St. Socrates. The Gospel according to Iamblichus.
-- Have an 18.9cents/minute 6 second billed calling card tomorrow --
http://www.vwis.com/cards.html

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Matthew Heaney wrote:
>

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Ronald Kunne wrote:
>
> In article <mheaney-ya0231800...@news.ni.net>
> mhe...@ni.net (Matthew Heaney) writes:
>

> >>The problem of constructing bug-free real-time software seems to me
> >>a trade-off between safety and speed of execution (and maybe available
> >>memory?). In other words: including tests on array boundaries might
> >>make the code saver, but also slower.
>

> >Why, yes. If the rocket blows up, at the cost of millions of dollars, then
> >I'm not clear what the value of "faster execution" is. The rocket's gone,
> >so what difference does it make how fast the code executed? If you left
> >the range checks in, your code would be *marginally* slower, but you'd
> >still have your rocket, now wouldn't you?
>
> Despite the sarcasm, I will elaborate.
>
> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

I might agree with the conclusion, but probably not with the argument.
If the array is statically typed to go from 0 to 100, and everything
that indexes it is statically typed for that range or smaller, most
modern Ada compilers won't generate _any_ code for the check.

I still believe the more interesting issue has to do with the _consequences_
of the check. If your environment doesn't lend itself to a reasonable response
to the check (quite possible in fail-operate systems inside systems that move
really fast), and you have to test the checks to make sure they don't _create_
a problem, then you've got a hard decision on your hands: suppress the check
(which might trigger a compiler bug or some other problems), or leave the check in
(which might introduce a problem, or divert your attention away from some other
problem).

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Alan Brain wrote:

>
> Ronald Kunne wrote:
>
> > The problem of constructing bug-free real-time software seems to me
> > a trade-off between safety and speed of execution (and maybe available
> > memory?). In other words: including tests on array boundaries might
> > make the code saver, but also slower.
> >

> > Comments?
>
> Bug-free software is not a reasonable criterion for success in a
> safety-critical system, IMHO. A good program should meet the
> requirements for safety etc despite bugs.

An OK statement for a fail-safe system. How do you propose to implement
this theory for a fail-operate system, particularly if there are system
constraints on weight, etc. that preclude hardware backups?

> Also despite hardware
> failures, soft failures, and so on.

A system which will always meet its requirements despite any combination
of failures is in the same regime as the perpetual motion system. If
you build one, you'll probably make a lot of money, so go to it!

> A really good safety-critical
> program should be remarkably difficult to de-bug, as the only way you
> know it's got a major problem is by examining the error log, and
> calculating that it's performance is below theoretical expectations.
> And if it runs too slow, many times in the real-world you can spend 2
> years of development time and many megabucks kludging the software, or
> wait 12 months and get the new 400 Mhz chip instead of your current 133.

I really need to change jobs. It sounds so much simpler to build
software for ground-based PCs, where you don't have to worry about the
weight, power requirements, heat dissipation, physical size,
vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.

Alan Brain

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

Ronald Kunne wrote:

> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

Why insist?
1. Suppressing all checks in Ada-83 makes about a 5% difference in
execution speed, in typical real-time and avionics systems. (For
example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
hardware budget is this tight,
you'd better not have lives at risk, or a lot of money, as technical
risk is
appallingly high.

2. If you know the range is 0-100, and you get 101, what does this show?
a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
analysis of your "can't happen" situation. As in re-use, or where your
array comes from an IO channel with noise on....

Type a) and d) failures should be caught during testing. Most of them.
OK, some of them. Range checking here is a neccessary debugging aid. But
type b) and c) can happen too out in the real world, and if you don't
test for an error early, you often can't recover the situation. Lives or
$ lost.

Brain's law:
"Software Bugs and Hardware Faults are no excuse for the Program not to
work".

So: it costs peanuts, and may save your hide.

Louis K. Scheffer

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

KU...@frcpn11.in2p3.fr (Ronald Kunne) writes:

>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
>
>Comments?

True in this case, but not in the way you might expect. The software group
decided that they wanted the guidance computers to be no more than 80 percent
busy. Range checking ALL the variables took too much time, so they analyzed
the situation and only checked those that might overflow. In the Ariane 4,
this particular variable could not overflow unless the trajectory was wildly
off, so they left out the range checking.

I think you could make a good case for range checking in the Ariane
software making it less safe, rather than more safe. The only reason they
check for overflow is to find hardware errors - since the software is designed
to not overflow, then any overflow must be because of a hardware problem, so
if any processor detects an overflow it shuts down. So on the one hand, each
additional range check increases the odds of catching a hardware error before
it does damage, but increases the odds that a processor shuts down while it
could still be delivering useful data. (Say the overflow occurs while
computing unimportant results, as on the Ariane 5). Given the relative
odds of hardware and software errors, it's not at all obvious to me that
range checking helps at all in this case!

The real problem is that they did not re-examine this software for the Ariane 5.If they had eitehr simulated it, or examined it closely, they would probably
have found this problem.
-Lou Scheffer

Robert A Duff

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

In article <324F11...@dynamite.com.au>,

Alan Brain <aeb...@dynamite.com.au> wrote:
>Brain's law:
>"Software Bugs and Hardware Faults are no excuse for the Program not to
>work".
>
>So: it costs peanuts, and may save your hide.

This reasoning doesn't sound right to me. The hardware part, I mean.
The reason checks-on costs only 5% or so is that compilers aggressively
optimize out almost all of the checks. When the compiler proves that a
check can't fail, it assumes that the hardware is perfect. So, hardware
faults and cosmics rays and so forth are just as likely to destroy the
RTS, or cause the program to take a wild jump, or destroy the call
stack, or whatever -- as opposed to getting a Constraint_Error a
reocovering gracefully. After all, the compiler doesn't range-check the
return address just before doing a return instruction!

- Bob

Wayne L. Beavers

unread,

Sep 30, 1996, 3:00:00 AM9/30/96

to

I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
area from damage. When I code in PL/I or any other reentrant language I always make sure that the executable
code is executing from read-only storage. There is no way to put the data areas in read-only storage
(obviously) but I can't think of any reason to put the executable code in writeable storage.

I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another. The
single most common error I had to correct was incorrect usage of pointer variables. I caught a lot of them
when ever they attempted to accidently store into the code area. At that point it is trivial to correct the
bug. This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

Michael Dworetsky

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

In article <84384503...@assen.demon.co.uk> jo...@assen.demon.co.uk (John McCabe) writes:
>r...@goanna.cs.rmit.edu.au (@@ robin) wrote:
>
><..snip..>
>
>Just a point for your information. From clari.tw.space:
>
> "An inquiry board investigating the explosion concluded in
>July that the failure was caused by software design errors in a
>guidance system."
>
>Note software DESIGN errors - not programming errors.
>

Indeed, the problems were in the specifications given to the programmers,
not in the coding activity itself. They wrote exactly what they were
asked to write, as far as I could see from reading the report summary.

The problem was caused by using software developed for Ariane 4's flight
characteristics, which were different from those of Ariane 5. When the
launch vehicle exceeded the boundary parameters of the Ariane-4 software,
it send an error message and, as specified by the remit given to
programmers, a critical guidance system shut down in mid-flight. Ka-boom.

--
Mike Dworetsky, Department of Physics | Haiku: Nine men ogle gnats
& Astronomy, University College London | all lit
Gower Street, London WC1E 6BT UK | till last angel gone.
email: m...@star.ucl.ac.uk | Men in Ukiah.

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Wayne L. Beavers wrote:
>
> I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
> area from damage. When I code in PL/I or any other reentrant language I always make sure that the executable
> code is executing from read-only storage. There is no way to put the data areas in read-only storage
> (obviously) but I can't think of any reason to put the executable code in writeable storage.

That's actually a pretty common rule of thumb for safety-critical systems.
Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
can cause a random change in the memory. So, it's not a perfect fix.

>
> I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another. The
> single most common error I had to correct was incorrect usage of pointer variables. I caught a lot of them
> when ever they attempted to accidently store into the code area. At that point it is trivial to correct the
> bug. This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

--

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Alan Brain wrote:
>
> 1. Suppressing all checks in Ada-83 makes about a 5% difference in
> execution speed, in typical real-time and avionics systems. (For
> example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
> hardware budget is this tight,
> you'd better not have lives at risk, or a lot of money, as technical
> risk is
> appallingly high.

Actually, I've seen systems where checks make much more than a 5% difference.
For example, in a flight control system, checks done in the redundancy
management monitor (comparing many redundant inputs in a tight loop) can
easily add 10% or more.

I have also seen flight-critical systems where 5% is a big deal, and where you
can _not_ add a more powerful processor to fix the problem. Flight control
software usually exists in a flight control _system_, with system issues of
power, cooling, space, etc. to consider. On a missile, these are important
issues. You might consider the technical risk "appalingly high," but the fix
for that risk can introduce equally dangerous risks in other areas.

> 2. If you know the range is 0-100, and you get 101, what does this show?
> a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
> soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
> analysis of your "can't happen" situation. As in re-use, or where your
> array comes from an IO channel with noise on....

You forgot (e) - a failure in the inputs. The range may be calculated,
directly or indirectly, from an input to the system. In practice, at least
for the systems I'm familiar with, that's usually where the error came
from -- either a connector fell off, or some wiring shorted out, or a bird
strike took out half of your sensors. I definitely would say that, when we
have a failure reported in operation, it's not usually because of a bug in
the software for our systems!

> Type a) and d) failures should be caught during testing. Most of them.
> OK, some of them. Range checking here is a neccessary debugging aid. But
> type b) and c) can happen too out in the real world, and if you don't
> test for an error early, you often can't recover the situation. Lives or
> $ lost.
>

> Brain's law:
> "Software Bugs and Hardware Faults are no excuse for the Program not to
> work".

Too bad that law can't be enforced :)

Wayne L. Beavers

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Ken Garlington wrote:

> That's actually a pretty common rule of thumb for safety-critical systems.
> Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> can cause a random change in the memory. So, it's not a perfect fix.

Your right, but the risk and probability of memory failures is pretty low I would think. I have never seen
or heard of a memory failure in any of the systems that I have worked on. I don't know what the current
technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
correctable. But I also don't work on realtime systems, my experience is with commercial systems.

Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
refering to ground base systems that don't have similar constraints?

Does anyone know just how good memory ECC is these days?

Wayne L. Beavers way...@beyond-software.com
Beyond Software, Inc.
The Mainframe/Internet Company
http://www.beyond-software.com/

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Wayne L. Beavers wrote:
>
> Ken Garlington wrote:
>
> > That's actually a pretty common rule of thumb for safety-critical systems.
> > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > can cause a random change in the memory. So, it's not a perfect fix.
>
> Your right, but the risk and probability of memory failures is pretty low I would think. I have never seen
> or heard of a memory failure in any of the systems that I have worked on. I don't know what the current
> technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> correctable. But I also don't work on realtime systems, my experience is with commercial systems.
>
> Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> refering to ground base systems that don't have similar constraints?

On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment
you can get quite a few failure _sources_, including mechanical failures (stress
fractures, solder loss due to excessive heat, etc.), electrical failures (EMI,
lightening), and so forth. You don't have to take out the actual chip, of course: just
as bad is a failure in the address or data lines connecting the memory to the CPU. Add
a memory management unit to the mix, along with various I/O devices mapped into the
memory space, and you can get a whole slew of memory-related failure modes.

You can also get into some neat system failures. For example, some "read-only" memory
actually allows writes to the execution space in certain modes, to allow quick
reprogramming. If you have a system failure that allows writes at the wrong time,
coupled with a failure that does a write where it shouldn't...

Sandy McPherson

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

It depends upon what you mean by a memory failure. I can imagine that
the chances of your memory being trashed completely is very very low and
in rad-hardened systems the chances of a single-event-upset (SEU) is
also low, but has to be guarded against. I have recently been working on
a system where the specified hardware has a parity bit for each octet of
memory, so SEUs which flip bit values in the memory can be detected.
This parity check is built into the system's micro-code.

Similarily the definition of what is and isn't read only memory is
usually a feature of the processor and or operating system being used. A
compiler cannot put code into read only areas of memory, unless the
processor its micro-code and/or o/s are playing ball as well. If you are
unfortunate enough to be in this situation (are there any such systems
left?), then the only thing you can do is DIY, but the compiler can't
help you much, other than the for-use-at.

I once read an interesting definition of two types of bugs in
"transaction processing" by Gray & Reuter, Heisenbugs and Bohrbugs.

Identification of potential Heisenbugs, estimation of probability of
occurence, impact to system on occurrence and appropriate recovery
procedures are part of the risk analysis. An SEU is a classic Heisenbug,
which IMO is out of scope of compiler checks, because they can result in
a valid but incorrect value for a variable and are just as likely to
occur in the code section as the data section of your application. A
complete memory failure is of course beyond the scope of the compiler.

IMO an Ada compiler's job (when used properly) is to make sure that
syntactic Bohrbugs do not enter a system and all semantic Bohrbugs get
detected at runtime (as Bohrbugs, by definition have a fixed location
and are certain to occur under given conditions- the Ariane 5 bug was
definitely a Bohrbug). The compiler cannot do anything about Heisenbugs
(because they only have a probability of occurrence). To handle
Heisenbugs generally you need to have a detection, reporting and
handling mechanism: built using the hardwares error detection, generally
accepted software practices (e.g. duplicate storage, process-pairs) and
an application dependent exception handling mechanism. Ada provides the
means to trap the error condition once it has been reported, but it does
not implement exception handlers for you, other than the default "I'm
gone..."; additionally if the underlying system does not provide the
means to detect a probable error, you have to implement the means of
detectin the probel and reporting this through the Ada exception
handling yourself.

Richard A. O'Keefe

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

"Wayne L. Beavers" <way...@beyond-software.com> writes:

>I have been reading this thread awhile and one topic that I have not
>seen mentioned is protecting the code area from damage.

I imagine that everyone else has taken this for granted.
UNIX compilers have been doing it for years, and so I believe have VMS ones.

>When I code in PL/I or any other reentrant language I always make sure
>that the executable code is executing from read-only storage.

(a) This is not something that the programmer should normally have to be
concerned with, it just happens.
(b) It cannot always be done. Run-time code generation is a practical and
important technique. (Making a page read-only after new code has been
written to it is a good idea, of course.)

>There is no way to put the data areas in read-only storage (obviously)

It may be obvious, but in important cases it isn't true.
UNIX (and I believe VMS) compilers have for years had the ability to put
_selected_ data in read-only storage. And of course it is perfectly
feasible in many operating systems (certainly UNIX and VMS) to write data
into a page and then ask the operating system to make that page read-only.

>but I can't think of any reason to put the executable code in writeable
>storage.

Run-time binary translation. Some approaches to relocation. How many
reasons do you want?

>I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable
>code from one system to another.

In a language where the last revision of the standard was 1976?
You have my deepest sympathy.

--
Australian citizen since 14 August 1996. *Now* I can vote the xxxs out!
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.

@@ robin

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Lawrence Foard <ent...@vwis.com> writes:

>Ronald Kunne wrote:

>> Actually, this was the case here: the code was taken from an Ariane 4
>> code where it was physically impossible that the index would go out
>> of range: a test would have been a waste of time.

---A test for overflow in a system that aborts if unexpected overflow
occurs, is never a waste of time.

Recall Murphy's Law: "If anything can go wrong, it will."
Then there's Robert's Law: "Even if it can't go wrong, it will."

>> Unfortunately this was no longer the case in the Ariane 5.

>Actually it would still present a danger on Ariane 4. If the sensor
>which apparently was no longer needed during flight became defective,
>then you could get a value out of range.

---Good point Lawrence.

@@ robin

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

jo...@assen.demon.co.uk (John McCabe) writes:

>Just a point for your information. From clari.tw.space:

> "An inquiry board investigating the explosion concluded in
>July that the failure was caused by software design errors in a
>guidance system."

>Note software DESIGN errors - not programming errors.

>Best Regards
>John McCabe <jo...@assen.demon.co.uk>

---If you read the Report, you'll see that that's not the case.
This is what the report says:

"* The internal SRI software exception was caused during execution of a
data conversion from 64-bit floating point to 16-bit signed integer
value. The floating point number which was converted had a value
greater than what could be represented by a 16-bit signed integer.
This resulted in an Operand Error. The data conversion instructions
(in Ada code) were not protected from causing an Operand Error,
although other conversions of comparable variables in the same place
in the code were protected.

"In the failure scenario, the primary technical causes are the Operand Error
when converting the horizontal bias variable BH, and the lack of protection
of this conversion which caused the SRI computer to stop."

---As you can see, it's clearly a programming error. It's a failure
to check for overflow on converting a double precision value to
a 16-bit integer.

Michel OLAGNON

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

But if you read a bit further on, it is stated that

The reason why three conversions, including the horizontal bias variable one,
were not protected, is that it was decided that they were physically bounded
or had a wide safety margin (...) The decision was a joint one of the project
partners at various contractual levels.

Deciding at various contractual levels is not what one usually means by
``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
can give any word any meaning.
And it might be probable that the action taken in case of protected conversion,
and exception, would also have been stop the SRI computer because such a high
horizontal bias would have meant that it was broken....

Michel

--
| Michel OLAGNON email : Michel....@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Steve Bell

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Michael Dworetsky wrote:
>
> >Just a point for your information. From clari.tw.space:
> >
> > "An inquiry board investigating the explosion concluded in
> >July that the failure was caused by software design errors in a
> >guidance system."
> >
> >Note software DESIGN errors - not programming errors.
> >
>

> Indeed, the problems were in the specifications given to the programmers,
> not in the coding activity itself. They wrote exactly what they were
> asked to write, as far as I could see from reading the report summary.
>
> The problem was caused by using software developed for Ariane 4's flight
> characteristics, which were different from those of Ariane 5. When the
> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
> it send an error message and, as specified by the remit given to
> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
>

I work for an aerospace company, and we recieved a fairly detailed accounting of what
went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch
pad, run a guidance program that updates their position and velocity in reference to
an coordinate frame whose origin is at the center of the earth (usually called an
Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4
hours before launch and is allowed to run all the way until liftoff, so that the
rocket will know where it's at and how fast it's going at liftoff. Although called
"ground software," (because it runs while the rocket is on the ground), it resides
inside the rocket's guidance computer(s), and for the Titan family of launch vehicles,
the code is exited at t=0 (liftoff). This code is designed with knowing that the
rocket is rotating on the surface of the earth, and the algorithms expect only very
mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff).
Well, the French do things a little differently (but probably now they don't). The
Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
past liftoff. They do (did) this in case there are any unanticipated holds in the
countdown right close to liftoff. In this way, this position and velocity updating
code would *not* have to be reset if they could get off the ground within just a few
seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad,
because at about 30 secs, it was pulling some accelerations that caused floating pount
overflows in the still functioning ground software. The actual flight software (which
was also running, naturally) was computing the positions and velocities that were
being used to actually fly the rocket, and it was doing just fine - no overflow errors
there because it was designed to expect high accelerations. There are two flight
computers on the Ariane 5 - a primary and a backup - and each was designed to shut
down if an error such as a floating point overflow occurred, thinking that the other
one would take over. Both computers were running the ground software, and both
experienced the floating point errors. Actually, the primary went belly-up first, and
then the backup within a fraction of a second later. With no functioning guidance
computer on board, well, ka-boom as you say.

Apparently the Ariane 4 gets off the ground with smaller accelerations than the 5, and
this never happened with a 4. You might take note that this would never happen with a
Titan because we don't execute this ground software after liftoff. Even if we did, we
would have caught the floating point overflows way before launch because we run all
code in what's called "Real-Time Simulations" where actual flight harware and software
are subjected to any and all known physical conditions. This was another finding of
the investigation board - apparently the French don't do enough of this type of
testing because it's real expensive. Oh well, they probably do now!

--
Clear skies,
Steve Bell
sb...@delphi.com
http://people.delphi.com/sb635 - Astrophoto page

Joseph C Williams

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Why didn't they run the code against an Ariane 5 simulator to
reverify the Ariane 4 software what was reused? A good real-time
engineering simulation would have caught the problem.

Wayne Hayes

unread,

Oct 6, 1996, 3:00:00 AM10/6/96

to

In article <32551A...@gsde.hso.link.com>,

Joseph C Williams <u6...@gsde.hso.link.com> wrote:
>Why didn't they run the code against an Ariane 5 simulator to
>reverify the Ariane 4 software what was reused?

Money. (The more cynical among us may say this translates to "stupidity".)

--
"Unix is simple and coherent, but it takes || Wayne Hayes, wa...@cs.utoronto.ca
a genius (or at any rate, a programmer) to || Astrophysics & Computer Science
appreciate its simplicity." -Dennis Ritchie|| http://www.cs.utoronto.ca/~wayne

Ken Garlington

unread,

Oct 7, 1996, 3:00:00 AM10/7/96

to

Steve Bell wrote:

> Well, the French do things a little differently (but probably now they don't). The
> Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
> past liftoff. They do (did) this in case there are any unanticipated holds in the
> countdown right close to liftoff. In this way, this position and velocity updating
> code would *not* have to be reset if they could get off the ground within just a few
> seconds of nominal.

But why 40 seconds? Why not 1 second (or one millisecond, for that matter)?

> You might take note that this would never happen with a
> Titan because we don't execute this ground software after liftoff. Even if we did, we
> would have caught the floating point overflows way before launch because we run all
> code in what's called "Real-Time Simulations" where actual flight harware and software
> are subjected to any and all known physical conditions. This was another finding of
> the investigation board - apparently the French don't do enough of this type of
> testing because it's real expensive.

Going way back into my history, I believe this is also true for Atlas.

> --
> Clear skies,
> Steve Bell
> sb...@delphi.com
> http://people.delphi.com/sb635 - Astrophoto page

--

LMTAS - "Our Brand Means Quality"

For more info, see http://www.lmtas.com or http://www.lmco.com

@@ robin

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

mola...@ifremer.fr (Michel OLAGNON) writes:

>In article <532k32$r...@goanna.cs.rmit.edu.au>, r...@goanna.cs.rmit.edu.au (@@ robin) writes:
>> jo...@assen.demon.co.uk (John McCabe) writes:
>>

>> >Just a point for your information. From clari.tw.space:
>>
>> > "An inquiry board investigating the explosion concluded in
>> >July that the failure was caused by software design errors in a
>> >guidance system."
>>
>> >Note software DESIGN errors - not programming errors.
>>

>| Michel OLAGNON email : Michel....@ifremer.fr|

But if you read further on ....

"However, three of the variables were left unprotected. No reference to
justification of this decision was found directly in the source code. Given
the large amount of documentation associated with any industrial
application, the assumption, although agreed, was essentially obscured,
though not deliberately, from any external review."

.... you'll see that there was no documentation in the code to
explain why these particular 3 (dangerous) conversions were
left unprotected. There is the implication that one or more
of them might have been overlooked . . . .. Don't place
too much reliance on the conclusion of the report, when
the detail is right there in the body of the report.

@@ robin

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

Steve Bell <sb...@delphi.com> writes:

>Michael Dworetsky wrote:
>>
>> >Just a point for your information. From clari.tw.space:
>> >
>> > "An inquiry board investigating the explosion concluded in
>> >July that the failure was caused by software design errors in a
>> >guidance system."
>> >
>> >Note software DESIGN errors - not programming errors.
>> >
>>

>> Indeed, the problems were in the specifications given to the programmers,
>> not in the coding activity itself. They wrote exactly what they were
>> asked to write, as far as I could see from reading the report summary.
>>
>> The problem was caused by using software developed for Ariane 4's flight
>> characteristics, which were different from those of Ariane 5. When the
>> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
>> it send an error message and, as specified by the remit given to
>> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
>>

>I work for an aerospace company, and we recieved a fairly detailed accounting of what
>went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch
>pad, run a guidance program that updates their position and velocity in reference to
>an coordinate frame whose origin is at the center of the earth (usually called an
>Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4
>hours before launch and is allowed to run all the way until liftoff, so that the
>rocket will know where it's at and how fast it's going at liftoff. Although called
>"ground software," (because it runs while the rocket is on the ground), it resides
>inside the rocket's guidance computer(s), and for the Titan family of launch vehicles,
>the code is exited at t=0 (liftoff). This code is designed with knowing that the
>rocket is rotating on the surface of the earth, and the algorithms expect only very
>mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff).

>Well, the French do things a little differently (but probably now they don't). The
>Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
>past liftoff. They do (did) this in case there are any unanticipated holds in the
>countdown right close to liftoff. In this way, this position and velocity updating
>code would *not* have to be reset if they could get off the ground within just a few

>seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad,
>because at about 30 secs, it was pulling some accelerations that caused floating pount
>overflows

---Definitely not. No floating-point overflow occurred. In
Ariane 5, the overflow occurred on converting a double-precision
(some 56 bits?) floating-point to a 16-bit integer (15
significant bits).

That's why it was so important to have a check that the
conversion couldn't overflow!

in the still functioning ground software. The actual flight software (which
>was also running, naturally) was computing the positions and velocities that were
>being used to actually fly the rocket, and it was doing just fine - no overflow errors
>there because it was designed to expect high accelerations. There are two flight
>computers on the Ariane 5 - a primary and a backup - and each was designed to shut
>down if an error such as a floating point overflow occurred,

---Again, not at all. It was designed to shut down if any interrupt
occurred. It wasn't intended to be shut down for a routine thing as
a conversion of floating-point to integer.

thinking that the other
>one would take over. Both computers were running the ground software, and both
>experienced the floating point errors.

---No, the backup SRI experienced the programming error (UNCHECKED
CONVERSION from floating-point to integer) first, and shut itself
down, then the active SRI computer experienced the same programming
error, then it shut itself down.

Steve O'Neill

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

@@ robin wrote:
> ---Definitely not. No floating-point overflow occurred. In
> Ariane 5, the overflow occurred on converting a double-precision
> (some 56 bits?) floating-point to a 16-bit integer (15
> significant bits).
>
> That's why it was so important to have a check that the
> conversion couldn't overflow!

> Agreed. Yes, the basic reason for the destruction of a billion dollar
vehicle was for want of a couple of lines of code. But it relects a
systemic problem much more damaging than what language was used.

I would have expected that in a mission/safety critical application
the proper checks would have been implemented, no matter what. And in a
'belts-and-suspenders' mode I would also expect an exception handler to
take care of unforeseen possibilities at the lowest possible level and
raise things to a higher level only when absolutely necessary. Had these
precautions been taken there would probably be lots of entries in an
error log but the satellites would now be orbiting.

As outsiders we can only second guess as to why this approach was not
taken but the review board implies that 1) the SRI software developers
had an 80% max utilization requirement and 2) careful consideration
(including faulty assumptions) was used in deciding what to protect and
not protect.

>It was designed to shut down if any interrupt occurred. It wasn't ^^^^^^^^^ exception, actually

>intended to be shut down for a routine thing as a conversion of
>floating-point to integer.

This was based on the (faulty) system-wide assumption that any exception
was the result of a random hardware failure. This is related to the
other faulty assumption that "software should be considered correct until
is proven to be at fault". But that's what the specification said.

> ---No, the backup SRI experienced the programming error (UNCHECKED
> CONVERSION from floating-point to integer) first, and shut itself
> down, then the active SRI computer experienced the same programming
> error, then it shut itself down.

Yes, according to the report the backup died first (by 0.05 seconds).
Probably not as a result of an unchecked_conversion though - the source
and target are of different sizes which would not be allowed. Most
likely just a conversion of a float to an sixteen-bit integer. This
would have raised a Constraint_Error (or Operand_Error in this
environment). This error could have been handled within the context of
this procedure (and the mission continued) but obviously was not.
Instead it appears to have been propagated to a global exception handler
which performed the specified actions admirably. Unfortunately these
included committing suicide and, in doing so, dooming the mission.

--
Steve O'Neill | "No,no,no, don't tug on that!
Sanders, A Lockheed Martin Company | You never know what it might
smon...@sanders.lockheed.com | be attached to."
(603) 885-8774 fax: (603) 885-4071| Buckaroo Banzai

Alan Brain

unread,

Oct 12, 1996, 3:00:00 AM10/12/96

to

Steve O'Neill wrote:

> I would have expected that in a mission/safety critical application
> the proper checks would have been implemented, no matter what. And in a
> 'belts-and-suspenders' mode I would also expect an exception handler to
> take care of unforeseen possibilities at the lowest possible level and
> raise things to a higher level only when absolutely necessary. Had these
> precautions been taken there would probably be lots of entries in an
> error log but the satellites would now be orbiting.

Concur completely. This should be Standard Operating Procedure, a matter
of habit. Frankly, it's just good engineering practice. But is honoured
more in the breach than the observance it seems, because....

> As outsiders we can only second guess as to why this approach was not
> taken but the review board implies that 1) the SRI software developers
> had an 80% max utilization requirement and 2) careful consideration
> (including faulty assumptions) was used in deciding what to protect and
> not protect.

... as some very reputable people, working for very reputable firms have
tried to pound into my thick skull, they are used to working with 15%,
no
more, tolerances. And with diamond-grade Hard Real Time slices, where
any
over-run, no matter how slight, means disaster. In this case, Formal
Proof
and strict attention to the no of CPU cycles in all possible paths seems
the only way to go.
But this leaves you so open to error in all but the simplest, most
trivial
tasks, (just the race analysis would be nightmarish) that these slices
had
better be a very small part of the task, or the task itself must be very
simple indeed. Either way, not having much bearing on the vast majority
of
problems I've encountered.
If the tasks are not simple....then can I please ask the firms concerned
to
tell me which aircraft their software is on, so I can take appropriate
action?

Ralf Tilch

unread,

Oct 17, 1996, 3:00:00 AM10/17/96

to

--

Hello,

I followed the discussion of the ARIANE 5 failure.
I didn't read all the mail's, and I am quite astonished
how far and how many details can discussed.
Like,
What for program-language would have been the best,
.....

It's good to know what's happen.
I think more important,
you built something new (very complex).
You invest some billion to develop it.
You built it (an ARIANE 5, put several sattelites).
The price of it several hundred millions
and you don't check as much as possible,
make a 'very complete check',
especially the software.

The reason that the software wasn't checked:
It was too 'expensive'?!?!.

They forgot murphy's law, which always 'works'.

I think you can't design a new car without
testing it completely.

We test 95% of the construction and after six month
selling the new car a weel will fall of at 160km/h.
Ok, there was a small problem in the construction-software
some wrong values, due to some over- or underflows or
whatever.

The result, the company probhably will have to pay quite a
lot and probhably to close !

--------------------------------------------------------
-DON'T TRUST YOURSELF, TRUST MURPHY'S LAW !!!!

"If anything can go wrong, it will."

--------------------------------------------------------
With this, have fun and continue the discussion about
conversion from 64bit to 16bit values,etc..

RT

________________|_______________________________________|_
| E-mail : R.T...@gmd.de |
| Tel. : (+49) (0)2241/14-23.69 |
________________|_______________________________________|_
| |

Ravi Sundaram

unread,

Oct 17, 1996, 3:00:00 AM10/17/96

to

Ralf Tilch wrote:
> The reason that the software wasn't checked:
> It was too 'expensive'?!?!.

Yeah, isn't hindsight a wonderful thing?
They, whoever were in charge of these decisions,
too knew testing is important. But it is impossible
to test every subcomponant under every possible
condition. There is simply not enough money or time
available to do that.

Take space shuttle for example. The total computing
power available on board is probably as much as used
in Nintindo gameboy. The design was frozen in 1970s.
Upgrading the computers and software would be so expensive
to test and prove they approach it with much trepidation.

Richard Feyman was examining the practices of NASA and
found that the workers who assembled some large bulkheads
had to count bolts from two refrence points. He thought
providing four reference points would simplify the job.
NASA rejected the proposal because it would involve
too many changes to the documentation, procedures and
testing. (Surely You are joking, Mr Feyman I? or II?)

So praise them for conducting a no nonsense investigation
and owning up to the mistakes. Learn to live with
failed space shots. They will become as reliable as
air travel once we have launched about 10 million rockets.

--
Ravi Sundaram.
10/17/96
PS: I am out of here. Going on vacation. Wont read followups
for a month.
(Opinions are mine, not Ansoft's.)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

shm...@os2bbs.com

unread,

Oct 22, 1996, 3:00:00 AM10/22/96

to

In <326674...@ansoft.com>, Ravi Sundaram <ra...@ansoft.com> writes:
>Ralf Tilch wrote:
>> The reason that the software wasn't checked:
>> It was too 'expensive'?!?!.
>
> Yeah, isn't hindsight a wonderful thing?
> They, whoever were in charge of these decisions,
> too knew testing is important. But it is impossible
> to test every subcomponant under every possible
> condition. There is simply not enough money or time
> available to do that.

Why do you assume that it was hindsight? They violated fundamental
software engineering principles, and anyone who has been in this business
for long should have expected chickens coming home to roost, even if they
couldn't predict what would go wrong first.

> Richard Feyman was examining the practices of NASA and
> found that the workers who assembled some large bulkheads
> had to count bolts from two refrence points. He thought
> providing four reference points would simplify the job.
> NASA rejected the proposal because it would involve
> too many changes to the documentation, procedures and
> testing. (Surely You are joking, Mr Feyman I? or II?)
>
> So praise them for conducting a no nonsense investigation
> and owning up to the mistakes. Learn to live with
> failed space shots. They will become as reliable as
> air travel once we have launched about 10 million rockets.

I hope that you're talking about Ariane and not NASA Challenger; Feynman's
account of the behavior of most of the Rogers Commission, in "Why Do
You Care ..." sounds more like a failed coverup than like "owning up to
their mistakes", and Feynman had to threaten to air a dissenting opinion
on television before they agreed to publish it in their report.

Shmuel (Seymour J.) Metz
Atid/2

Jim Carr

unread,

Oct 22, 1996, 3:00:00 AM10/22/96

to

shmue...@os2bbs.com writes:
>
>I hope that you're talking about Ariane and not NASA Challenger; Feynman's
>account of the behavior of most of the Rogers Commission, in "Why Do
>You Care ..." sounds more like a failed coverup than like "owning up to

>their mistakes", ...

The coverup was not entirely unsuccessful. Feynman did manage to break
through and get his dissenting remarks on NASA reliability estimates
into the report (as well as into Physics Today), but the coverup did
succeed in keeping most people ignorant of the fact that the astronauts
did not die until impact with the ocean despite a Miami Herald story
pointing that out to its mostly-regional audience.

Did you ever see a picture of the crew compartment?

--
James A. Carr <j...@scri.fsu.edu> | Raw data, like raw sewage, needs
http://www.scri.fsu.edu/~jac | some processing before it can be
Supercomputer Computations Res. Inst. | spread around. The opposite is
Florida State, Tallahassee FL 32306 | true of theories. -- JAC

hayim

unread,

Oct 24, 1996, 3:00:00 AM10/24/96

to

Unfortunately, I missed the original article describing the Ariane failure.
If someone could please, either point me in the right direction as to where
I can get a copy, or could even send it to me, I would greatly appreciate it.

Thanks very much,

Hayim Hendeles

E-mail: ha...@platsol.com

Michel OLAGNON

unread,

Oct 25, 1996, 3:00:00 AM10/25/96

to

In article <54oht1$l...@orchard.la.platsol.com>, <hayim> writes:
>Unfortunately, I missed the original article describing the Ariane failure.
>If someone could please, either point me in the right direction as to where
>I can get a copy, or could even send it to me, I would greatly appreciate it.
>

It may be useful to remind the source address for the full report, since
many comments seem based only on a presentation summary:

http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

Ken Garlington

unread,

Oct 25, 1996, 3:00:00 AM10/25/96

to

hayim wrote:
>
> Unfortunately, I missed the original article describing the Ariane failure.
> If someone could please, either point me in the right direction as to where
> I can get a copy, or could even send it to me, I would greatly appreciate it.
>

> Thanks very much,
>
> Hayim Hendeles
>
> E-mail: ha...@platsol.com

See:
http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html