Ariane 5 failure

@@ robin

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

>I read the following message from my co-workers that I thought was
>interesting. So I'm forwarding it to here.

>(begin quote)
>Ariane 5 failure was attributed to a faulty DOUBLE -> INT conversion
>(as the proximate cause) in some ADA code in the inertial guidance
>system. Diagnostic error messages from the (faulty) inertial guidance
>system software were interpreted by the steering system as valid data.

>English text of the inquiry board's findings is at
> http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
>(end quote)

>Amara Graps email: agr...@netcom.com
>Computational Physics vita: finger agr...@best.com

THere's a little more to it . . .

The unchecked data conversion in the Ada program resulted
in the shutdown of the computer. The backup computer had
already shut down a whisker of a second before, Consequently,
the on-board computer was unable to switch to the backup, and
used the error codes from the shutdown computer as
flight data.

This is not the first time that such a programming error
(integer out of range) has occurred.

In 1981, the manned STS-2 was preparing to take off, but because
some fuel was accidentally spilt and some tiles accidentally
dislodged, takeoff was delayed by a month.

During that time, the astronauts decided to get in some
more practice with the simulator.

During a simulated descent, the 4 computing systems (the main
and the 3 backups) got stuck in a loop, with the complete
loss of control.

The cause? An integer out of range -- the same problem
as with Ariane 5, where an integer became out of range.

In the STS-2 case, the precise cause was a computed GOTO
with a bad index (similar to a CASE statement without
an OTHERWISE clause).

In both cases, the programing error could have been detected
with a simple test, but in both cases, no test was included.

One would have thought that having had one failure (at least)
for integer out-of-range, that the implementors of the software
for Ariane 5 would have been extra careful in ensuring that
all data conversions were within range -- since any kind
of interrupt would result in destruction of the spacecraft.

There's a case for a review of the programming language used.

Michel OLAGNON

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <52a572$9...@goanna.cs.rmit.edu.au>, r...@goanna.cs.rmit.edu.au (@@ robin) writes:
>[reports of Ariane and STS-2 bugs deleted]

>
>
>In both cases, the programing error could have been detected
>with a simple test, but in both cases, no test was included.
>
>One would have thought that having had one failure (at least)
>for integer out-of-range, that the implementors of the software
>for Ariane 5 would have been extra careful in ensuring that
>all data conversions were within range -- since any kind
>of interrupt would result in destruction of the spacecraft.
>

May be the main reason for the lack of testing and care was
that the conversion exception could only occur after lift off,
and that that particular piece of program was of no use after
lift off. It was only kept running for 50 s in order to
speed up countdown restart in case of an interruption between
H0-9 and H0-5.

Conclusion: Never compute values that are of no use when you can
avoid it !

>There's a case for a review of the programming language used.

Michel
--
| Michel OLAGNON email : Michel....@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Byron Kauffman

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

Michel OLAGNON wrote:
>
> May be the main reason for the lack of testing and care was
> that the conversion exception could only occur after lift off,
> and that that particular piece of program was of no use after
> lift off. It was only kept running for 50 s in order to
> speed up countdown restart in case of an interruption between
> H0-9 and H0-5.
>
> Conclusion: Never compute values that are of no use when you can
> avoid it !
>
> >There's a case for a review of the programming language used.
>
> Michel
> --
> | Michel OLAGNON email : Michel....@ifremer.fr|
> | IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Of course, Michel, you've got a great point, but let me give you some
advice,
assuming you haven't read this thread for the last few months (seems
like years). Robin's whole point is that he firmly believes that the
problem would not have occurred if PL/I had been used instead of Ada.
Several EXTREMELY competent and experienced engineers who actually have
written flight-control software have patiently, and in some cases
(though I can't blame them) impatiently attempted to explain the
situation - that this was a bad design/management decision combined with
a fatal oversight in testing - to this poor student, but alas, to no
avail.

My advice, Michel - blow it off and don't let ++robin (or is it
@@robin?) get
to you, because "++robin" is actually an alias for John Cleese. He's
gathering material for a sequel to "The Argument Sketch"... :-)

A. Grant

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <32492E...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
>Several EXTREMELY competent and experienced engineers who actually have
>written flight-control software have patiently, and in some cases
>(though I can't blame them) impatiently attempted to explain the
>situation - that this was a bad design/management decision combined with
>a fatal oversight in testing - to this poor student, but alas, to no
>avail.

Robin is not a student. He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.

Bob Kitzberger

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

@@ robin (r...@goanna.cs.rmit.edu.au) wrote:
: The cause? An integer out of range -- the same problem

: as with Ariane 5, where an integer became out of range.

...
: There's a case for a review of the programming language used.

Why do you persist?

Ada _has_ range checks built into the language. They were explicitly
disabled in this case.

What are you failing to grasp?

--
Bob Kitzberger Rational Software Corporation r...@rational.com
http://www.rational.com http://www.rational.com/pst/products/testmate.html

Chris Morgan

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

In article <ag129.804...@ucs.cam.ac.uk> ag...@ucs.cam.ac.uk
(A. Grant) writes:

Robin is not a student. He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.

I'm tempted to say "not so reputable to readers of this newsgroup"
after the ridiculous statements made by Robin w.r.t. Ariane 5 but
Richard A. O'Keefe's regular excellent postings more than balance them
out.

Chris
--
--
Chris Morgan |email c...@mihalis.demon.co.uk (home)
http://www.mihalis.demon.co.uk/ | or chris....@baesema.co.uk (work)

Ken Garlington

unread,

Sep 25, 1996, 3:00:00 AM9/25/96

to

A. Grant wrote:
> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

When it comes to building embedded safety-critical systems, trust me:
He's a student!

--
LMTAS - "Our Brand Means Quality"

Ronald Kunne

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

In article <52bm1c$g...@rational.rational.com>

r...@rational.com (Bob Kitzberger) writes:

>Ada _has_ range checks built into the language. They were explicitly
>disabled in this case.

The problem of constructing bug-free real-time software seems to me
a trade-off between safety and speed of execution (and maybe available
memory?). In other words: including tests on array boundaries might
make the code saver, but also slower.

Comments?

Greetings,
Ronald

Byron Kauffman

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

A. Grant wrote:
>
> In article <32492E...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
> >Several EXTREMELY competent and experienced engineers who actually have
> >written flight-control software have patiently, and in some cases
> >(though I can't blame them) impatiently attempted to explain the
> >situation - that this was a bad design/management decision combined with
> >a fatal oversight in testing - to this poor student, but alas, to no
> >avail.
>

> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

A. -

Thank you for confirming my long-held theory that those who inhabit the
ivory towers
of engineering/CS academia should spend 2 of every 5 years working at a
real job out
in the real world. My intent is not to slam professors who are in touch
with reality,
of course (e.g., Feldman, Dewar, et al), but the idealistic theoretical
side often
is a far cry from the practical, just-get-it-done world we have to deal
with once
we're out of school.

I just KNOW there's a good Dilbert strip here somewhere...

Sandy McPherson

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

A. Grant wrote:
>
> Robin is not a student. He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

Why doesn't he wise up and act like one then?

I don't know the man, and I suspect he has been winding everybody up
just for a laugh. But, if this is not the case, the thought of such a
closed mind teaching students is quite horrific.

"Use PL/I mate, you'll be tucker",

--
Sandy McPherson MBCS CEng. tel: +31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk

Matthew Heaney

unread,

Sep 26, 1996, 3:00:00 AM9/26/96

to

In article <1780E84...@frcpn11.in2p3.fr>, KU...@frcpn11.in2p3.fr
(Ronald Kunne) wrote:

Why, yes. If the rocket blows up, at the cost of millions of dollars, then
I'm not clear what the value of "faster execution" is. The rocket's gone,
so what difference does it make how fast the code executed? If you left
the range checks in, your code would be *marginally* slower, but you'd
still have your rocket, now wouldn't you?

>Ronald

Matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mhe...@ni.net
(818) 985-1271

Wayne Hayes

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <mheaney-ya0231800...@news.ni.net>,

Matthew Heaney <mhe...@ni.net> wrote:
>Why, yes. If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is. The rocket's gone,
>so what difference does it make how fast the code executed? If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

You have a moot point. In this case, catching the error wouldn't have
helped. The out-of-bounds error happened in a piece of code designed
for the Ariane-4, in which it was *physically impossible* for the value
to overflow (the Ariane-4 didn't go that fast, and it was a velocity
variable). Then the code was used, as-is, in the Ariane-5, without an
analysis of how the code would react in the new hardware, which flew
faster. Had the analysis been done, they wouldn't have added bounds
checking, they would have modified the code to actually *work*, because
they would have realized that the code was *guaranteed* to fail on the
first flight.

--
"And a woman needs a man... || Wayne Hayes, wa...@cs.utoronto.ca
like a fish needs a bicycle..." || Astrophysics & Computer Science
-- U2 (apparently quoting Gloria Steinem?) || http://www.cs.utoronto.ca/~wayne

Alan Brain

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.
>
> Comments?

Bug-free software is not a reasonable criterion for success in a
safety-critical system, IMHO. A good program should meet the
requirements for safety etc despite bugs. Also despite hardware
failures, soft failures, and so on. A really good safety-critical
program should be remarkably difficult to de-bug, as the only way you
know it's got a major problem is by examining the error log, and
calculating that it's performance is below theoretical expectations.

And if it runs too slow, many times in the real-world you can spend 2
years of development time and many megabucks kludging the software, or
wait 12 months and get the new 400 Mhz chip instead of your current 133.

---------------------- <> <> How doth the little Crocodile
| Alan & Carmel Brain| xxxxx Improve his shining tail?
| Canberra Australia | xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo oo oo oo
By pulling Maerklin Wagons, in 1/220 Scale

Ronald Kunne

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <mheaney-ya0231800...@news.ni.net>

mhe...@ni.net (Matthew Heaney) writes:

>>The problem of constructing bug-free real-time software seems to me
>>a trade-off between safety and speed of execution (and maybe available
>>memory?). In other words: including tests on array boundaries might
>>make the code saver, but also slower.

>Why, yes. If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is. The rocket's gone,
>so what difference does it make how fast the code executed? If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

Despite the sarcasm, I will elaborate.

Suppose an array goes from 0 to 100, and the calculated index is known
not to go outside this range. Why would one insist on putting the
range test in, which will slow down the code? This might be a problem
if the particular piece of code is heavily used, and the code executes
too slowly otherwise. "Marginally slower" if it happens only once, but
such checks on indices and function arguments (like squareroots), are
necessary *everywhere* in code, if one is consequent.

Actually, this was the case here: the code was taken from an Ariane 4
code where it was physically impossible that the index would go out
of range: a test would have been a waste of time.
Unfortunately this was no longer the case in the Ariane 5.

Friendly greetings,
Ronald Kunne

A. Grant

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

In article <324A7C...@lmtas.lmco.com> Byron Kauffman <Kauff...@lmtas.lmco.com> writes:
>A. Grant wrote:
>> Robin is not a student. He is a senior lecturer at the Royal
>> Melbourne Institute of Technology, a highly reputable institution.

>Thank you for confirming my long-held theory that those who inhabit the

>ivory towers of engineering/CS academia should spend 2 of every 5 years
>working at a real job out in the real world. My intent is not to slam
>professors who are in touch with reality, of course (e.g., Feldman,
>Dewar, et al), but the idealistic theoretical side often is a far cry
>from the practical, just-get-it-done world we have to deal with once
>we're out of school.

You're being a bit hard on theoretical computer scientists here.
Just because it's called computer science doesn't mean it has to be
able to instantly make money on real computers. And the Ariane 5
failure was due to pragmatism (reusing old stuff to save money)
not idealism (applying theoretical proofs of correctness).

But in any case RMIT is noted for its involvement with industry.
(I used to work for a start-up company out of RMIT premises.)
If PL/I is being pushed by RMIT it's probably because the DP
managers in Collins St. want it. Australia doesn't have much call
for aerospace systems.

Ken Garlington

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:
>
> In article <52bm1c$g...@rational.rational.com>
> r...@rational.com (Bob Kitzberger) writes:
>
> >Ada _has_ range checks built into the language. They were explicitly
> >disabled in this case.
>

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.

Particularly for fail-operate systems that must continue to function in
harsh environments, memory and throughput can be tight. This usually happens
because the system must continue to operate on emergency power and/or
cooling. At least until recently, the processing systems that had lots of
memory and CPU power also had larger power and cooling requirements, so they
couldn't always be used in this class of systems. (That's changing, somewhat.) So,
the tradeoff you describe can occur.

The trade-off I find even more interesting is the safety gained from
adding extra features vs. the safety _lost_ by adding those features. Every
time you add a check, whether it's an explicit check or one automatically
generated by the compiler, you have to have some way to gain confidence that
the check will not only work, but won't create some side-effect that causes
a different problem. The effort expended to get confidence for that additional
feature is effort that can't be spent gaining assurance of other features in
the system, assuming finite resources. There is no magic formula I've ever
seen to make that trade-off - ultimately, it's human judgement.

John McCabe

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

r...@goanna.cs.rmit.edu.au (@@ robin) wrote:

<..snip..>

Just a point for your information. From clari.tw.space:

"An inquiry board investigating the explosion concluded in
July that the failure was caused by software design errors in a
guidance system."

Note software DESIGN errors - not programming errors.

Best Regards
John McCabe <jo...@assen.demon.co.uk>

Lawrence Foard

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

Ronald Kunne wrote:
>
> Actually, this was the case here: the code was taken from an Ariane 4
> code where it was physically impossible that the index would go out
> of range: a test would have been a waste of time.
> Unfortunately this was no longer the case in the Ariane 5.

Actually it would still present a danger on Ariane 4. If the sensor
which apparently was no longer needed during flight became defective,
then you could get a value out of range.

--
The virgin birth of Pythagoras via Apollo. The martyrdom of
St. Socrates. The Gospel according to Iamblichus.
-- Have an 18.9cents/minute 6 second billed calling card tomorrow --
http://www.vwis.com/cards.html

Richard Pattis

unread,

Sep 27, 1996, 3:00:00 AM9/27/96

to

As an instructor in CS1/CS2, this discussion interests me. I try to talk about
designing robust, reusable code, and actually have students reuse code that
I have written as well as some that they (and their peers) have written.
The Ariane falure adds a new view to robustness, having to do with future
use of code, and mathematical proof vs "engineering" considerations..

Should a software engineer remove safety checks if he/she can prove - based on
physical limitations, like a rocket not exceeding a certain speed - that they
are unnecessary. Or, knowing that his/her code will be reused (in an unknown
context, by someone who is not so skilled, and will probably not think to
redo the proof) should such checks not be optimized out? What rule of thumb
should be used to decide (e.g., what if the proof assumes the rocket speed
will not exceed that of light)? Since software operates in the real world (not
the world of mathematics) should mathematical proofs about code always yield
to engineering rules of thumb to expect the unexpected.

"In the Russian theatre, every 5 years an unloaded gun accidentally
discharges and kills someone; every 20 years a broom does."

What is the rule of thumb about when should mathematics be believed?

As to saving SPEED by disabling the range checks: did the code not meet its
speed requirements with range checks on? Only in this case would I have turned
them off. Does "real time" mean fast enough or as fast as possible? To
misquote Einstein, "Code should run as fast as necessary, but no faster...."
since something is always traded away to increase speed.

If I were to try to create a lecture on this topic, what other similar
failures should I know about (beside the legendary Venus probe)?
Your comments?

Rich

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Matthew Heaney wrote:
>

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Ronald Kunne wrote:
>
> In article <mheaney-ya0231800...@news.ni.net>
> mhe...@ni.net (Matthew Heaney) writes:
>

> >>The problem of constructing bug-free real-time software seems to me
> >>a trade-off between safety and speed of execution (and maybe available
> >>memory?). In other words: including tests on array boundaries might
> >>make the code saver, but also slower.
>

> >Why, yes. If the rocket blows up, at the cost of millions of dollars, then
> >I'm not clear what the value of "faster execution" is. The rocket's gone,
> >so what difference does it make how fast the code executed? If you left
> >the range checks in, your code would be *marginally* slower, but you'd
> >still have your rocket, now wouldn't you?
>
> Despite the sarcasm, I will elaborate.
>
> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

I might agree with the conclusion, but probably not with the argument.
If the array is statically typed to go from 0 to 100, and everything
that indexes it is statically typed for that range or smaller, most
modern Ada compilers won't generate _any_ code for the check.

I still believe the more interesting issue has to do with the _consequences_
of the check. If your environment doesn't lend itself to a reasonable response
to the check (quite possible in fail-operate systems inside systems that move
really fast), and you have to test the checks to make sure they don't _create_
a problem, then you've got a hard decision on your hands: suppress the check
(which might trigger a compiler bug or some other problems), or leave the check in
(which might introduce a problem, or divert your attention away from some other
problem).

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

Alan Brain wrote:

>
> Ronald Kunne wrote:
>
> > The problem of constructing bug-free real-time software seems to me
> > a trade-off between safety and speed of execution (and maybe available
> > memory?). In other words: including tests on array boundaries might
> > make the code saver, but also slower.
> >

> > Comments?
>
> Bug-free software is not a reasonable criterion for success in a
> safety-critical system, IMHO. A good program should meet the
> requirements for safety etc despite bugs.

An OK statement for a fail-safe system. How do you propose to implement
this theory for a fail-operate system, particularly if there are system
constraints on weight, etc. that preclude hardware backups?

> Also despite hardware
> failures, soft failures, and so on.

A system which will always meet its requirements despite any combination
of failures is in the same regime as the perpetual motion system. If
you build one, you'll probably make a lot of money, so go to it!

> A really good safety-critical
> program should be remarkably difficult to de-bug, as the only way you
> know it's got a major problem is by examining the error log, and
> calculating that it's performance is below theoretical expectations.
> And if it runs too slow, many times in the real-world you can spend 2
> years of development time and many megabucks kludging the software, or
> wait 12 months and get the new 400 Mhz chip instead of your current 133.

I really need to change jobs. It sounds so much simpler to build
software for ground-based PCs, where you don't have to worry about the
weight, power requirements, heat dissipation, physical size,
vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.

Ken Garlington

unread,

Sep 28, 1996, 3:00:00 AM9/28/96

to

From the "There's always time to test it the second time around"
department...

ORBITAL JUNK: The second Ariane 5 to be launched in April at the
earliest will put two dummy satellites, worth less than $3
million, into orbit. The first Ariane 5 exploded in June carrying
four uninsured satellites worth $500 million. (Financial Times)

I wonder if the test labs at Arianespace, etc. are keeping busy... :)

Dann Corbit

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

I propose a software IC metaphor for high
reliability projects. (And all eventually).

Currently, the software industry goes by
what I call a "software schematic" metaphor.
We put in components that are tested, but
we do not necessarily know the performance
curves.

If you look at S. Moshier's code in the
Cephes Library on Netlib, you will see that
he offers statistical evidence that his
programs are robust. So you can at least
infer, on a probability basis, what the odds
are of a component failing. So instead of
just dropping in a resistor or a transistor,
we read the little gold band, or the spec
on the transistor that shows what voltages
it can operate under.
For simple components with, say, five bytes
of input, we could exhaustively test all
possible inputs and outputs. For more
complicated procedures with many bytes of
inputs, we could perform probability testing,
and test other key values.

Imagine a database like the following:
TABLE: MODULES
int ModuleUniqueID
int ModuleCategory
char*60 ModuleName
char*255 ModuleDescription
text ModuleCode
text TestRoutineUsed
bit CompletelyTested

TABLE: TestResults (many result sets for one module)
int TestResultUniqueID
int ModuleUniqueID
char*60 OperatingSystem
char*60 CompilerUsed
binary ResultChart
text ResultDescription
float ProbabilityOfFailure
float RmsErrorObserved
float MaxErrorObserved

TABLE: KnownBugs (many known bugs for one module)
int KnownBugUniqueID
int ModuleUniqueID
char*60 KnownBugDescription
text BugDefinition
text PossibleWorkAround

Well, this is just a rough outline, but the value of
a database like this would be obvious. This could
easily be improved and expanded. (More domain tables,
tables for defs of parameters to the module, etc.)

If we had a tool like that, we would be using
software IC's, not software schematics.
--
"I speak for myself and all of the lawyers of the world"
If I say something dumb, then they will have to sue themselves.

Alan Brain

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

Ronald Kunne wrote:

> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

Why insist?
1. Suppressing all checks in Ada-83 makes about a 5% difference in
execution speed, in typical real-time and avionics systems. (For
example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
hardware budget is this tight,
you'd better not have lives at risk, or a lot of money, as technical
risk is
appallingly high.

2. If you know the range is 0-100, and you get 101, what does this show?
a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
analysis of your "can't happen" situation. As in re-use, or where your
array comes from an IO channel with noise on....

Type a) and d) failures should be caught during testing. Most of them.
OK, some of them. Range checking here is a neccessary debugging aid. But
type b) and c) can happen too out in the real world, and if you don't
test for an error early, you often can't recover the situation. Lives or
$ lost.

Brain's law:
"Software Bugs and Hardware Faults are no excuse for the Program not to
work".

So: it costs peanuts, and may save your hide.

Louis K. Scheffer

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

KU...@frcpn11.in2p3.fr (Ronald Kunne) writes:

>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
>
>Comments?

True in this case, but not in the way you might expect. The software group
decided that they wanted the guidance computers to be no more than 80 percent
busy. Range checking ALL the variables took too much time, so they analyzed
the situation and only checked those that might overflow. In the Ariane 4,
this particular variable could not overflow unless the trajectory was wildly
off, so they left out the range checking.

I think you could make a good case for range checking in the Ariane
software making it less safe, rather than more safe. The only reason they
check for overflow is to find hardware errors - since the software is designed
to not overflow, then any overflow must be because of a hardware problem, so
if any processor detects an overflow it shuts down. So on the one hand, each
additional range check increases the odds of catching a hardware error before
it does damage, but increases the odds that a processor shuts down while it
could still be delivering useful data. (Say the overflow occurs while
computing unimportant results, as on the Ariane 5). Given the relative
odds of hardware and software errors, it's not at all obvious to me that
range checking helps at all in this case!

The real problem is that they did not re-examine this software for the Ariane 5.If they had eitehr simulated it, or examined it closely, they would probably
have found this problem.
-Lou Scheffer

Alan Brain

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

Richard Pattis wrote:
>
> As an instructor in CS1/CS2, this discussion interests me. I try to talk about

> designing robust, reusable code.... --->8----

> The Ariane falure adds a new view to robustness, having to do with future
> use of code, and mathematical proof vs "engineering" considerations..
>
> Should a software engineer remove safety checks if he/she can prove - based on
> physical limitations, like a rocket not exceeding a certain speed - that they
> are unnecessary. Or, knowing that his/her code will be reused (in an unknown
> context, by someone who is not so skilled, and will probably not think to
> redo the proof) should such checks not be optimized out? What rule of thumb
> should be used to decide (e.g., what if the proof assumes the rocket speed
> will not exceed that of light)? Since software operates in the real world (not
> the world of mathematics) should mathematical proofs about code always yield
> to engineering rules of thumb to expect the unexpected.

> What is the rule of thumb about when should mathematics be believed?
>

Firstly, I wish more there were more CS teachers like you. These are
excellent
Engineering questions.

Secondly, answers:
I tend towards the philosophy of "Leave every check in". In 12+ years of
Ada programming, I've never seen Pragma Suppress All Checks make the
difference
between success and failure. At best it gives a 5% improvement. This
means
in order to debug the code quickly, it's useful to have such checks,
even when
not strictly neccessary.

For re-use, you then often have the Ariane problem. That is, the
un-neccessary
checks you included coming around and biting you, as the assumptions you
were
making in the previous project become invalid.

So.... You make sure the assumptions/consequences get put into a
seperate package.
A system-specific package, that will be changed when re-used. Which
means that if the subsystem gets re-used a lot, the system specific
stuff will eventually be re-written so as to allow for re-use easily.
Example: Car's Cruise Control: MAX_SPEED : constant 200.0*MPH;
Get's re-used in an airliner - change to 700.0*MPH. Then onto an SST -
2000.0*MPH.
Eventually, you make it 2.98E26*MetresPerSec. Then some Bunt invents a
Warp Drive, and you're wrong again.

Summary: Label the constraints and assumptions, stick them as comments
in the code and design notes, put them in a seperate package...and some
dill will still stuff up, but that's the best you can do. And in the
meantime, you allow the possibility of finding a number of errors
early.

Robert A Duff

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

In article <324F11...@dynamite.com.au>,

Alan Brain <aeb...@dynamite.com.au> wrote:
>Brain's law:
>"Software Bugs and Hardware Faults are no excuse for the Program not to
>work".
>
>So: it costs peanuts, and may save your hide.

This reasoning doesn't sound right to me. The hardware part, I mean.
The reason checks-on costs only 5% or so is that compilers aggressively
optimize out almost all of the checks. When the compiler proves that a
check can't fail, it assumes that the hardware is perfect. So, hardware
faults and cosmics rays and so forth are just as likely to destroy the
RTS, or cause the program to take a wild jump, or destroy the call
stack, or whatever -- as opposed to getting a Constraint_Error a
reocovering gracefully. After all, the compiler doesn't range-check the
return address just before doing a return instruction!

- Bob

Chris McKnight

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

In article H...@beaver.cs.washington.edu, pat...@cs.washington.edu (Richard Pattis) writes:
>As an instructor in CS1/CS2, this discussion interests me. I try to talk about

>designing robust, reusable code, and actually have students reuse code that
>I have written as well as some that they (and their peers) have written.

>The Ariane falure adds a new view to robustness, having to do with future
>use of code, and mathematical proof vs "engineering" considerations..

An excellent bit of teaching, IMHO. Glad to hear they're putting some
more of the real world issues in the class room.

>Should a software engineer remove safety checks if he/she can prove - based on
>physical limitations, like a rocket not exceeding a certain speed - that they
>are unnecessary. Or, knowing that his/her code will be reused (in an unknown
>context, by someone who is not so skilled, and will probably not think to
>redo the proof) should such checks not be optimized out? What rule of thumb
>should be used to decide (e.g., what if the proof assumes the rocket speed
>will not exceed that of light)? Since software operates in the real world (not
>the world of mathematics) should mathematical proofs about code always yield
>to engineering rules of thumb to expect the unexpected.

A good question. For the most part, I'd go with engineering rules of thumb
(what did you expect, I'm an engineer). As an engineer, you never know what
may happen in the real world (in spite of what you may think), so I prefer
error detection and predictable recovery. The key factors to consider include
the likelihood and the cost of failures, and the cost of leaving in (or adding
where your language doesn't already provide it) the checks.

Consider these factors, likelihood and cost of failures:

In a real-time embedded system, both of these factors are often high. Of
the two, I think people most often get caught on misbeliefs on likelihood of
failure. As an example, I've argued more than once with engineers who think
that since a device is only "able" to give them a value in a certain range,
they needn't check for out of range values. I've seen enough failed hardware
to know that anything is possible, regardless of what the manufacturer may
claim. Consider your speed of light example, what if the sensor goes bonkers
and tells you that you're going faster? Your "proof" that you can't get that
value falls apart then. Your point about reuse is also well made. Who knows
what someone else may want to use your code for?

As for cost of failure, it's usually obvious; in dollars, in lives, or both.

As for cost of leaving checks in (or putting them in):

IMHO, the cost is almost always insignificant. If the timing is so tight that
removing checks makes the difference, it's probably time to redesign anyway.
Afterall, in the real world there's always going to be fixes, new features,
etc.. that need to be added later, so you'd better plan for it. Also, it's
been my experience that removing checks is somewhere in the single digits
on % improvement. If you're really that tight, a good optimizer can yield
10%-15% or more (actual mileage may vary of course). But again, if that
makes the difference, you'd better rethink your design.

So the rule of thumb I use is, unless a device is not physically capable (as
opposed to theoretically capable) of giving me out of range data, I'm going
to range check it. I.E. if there's 3 bits, you'd better check for 8 values
regardless of the number of values you think you can get.

That having been said, it's often not up to the engineer to make these
decisions. Such things as political considerations, customer demands, and
(more often than not) management decisions have been known to succeed in
convincing me to turn checks off. As a rule, however, I fight to keep them
in, at very least through development and integration.

> As to saving SPEED by disabling the range checks: did the code not meet its
>speed requirements with range checks on? Only in this case would I have turned
>them off. Does "real time" mean fast enough or as fast as possible? To
>misquote Einstein, "Code should run as fast as necessary, but no faster...."
>since something is always traded away to increase speed.

Precisely! And when what's being traded is safety, it's not worth it.

Cheers,

Chris

=========================================================================

"I was gratified to be able to answer promptly. I said I don't know".
-- Mark Twain

=========================================================================

Michael Feldman

unread,

Sep 29, 1996, 3:00:00 AM9/29/96

to

In article <1996Sep29.1...@enterprise.rdd.lmsc.lockheed.com>,
Chris McKnight <cmck...@hercii.lasc.lockheed.com> wrote:

[Rich Pattis' good stuff snipped.]

>
> An excellent bit of teaching, IMHO. Glad to hear they're putting some
> more of the real world issues in the class room.

Rich Pattis is indeed an experienced, even gifted teacher of
introductory courses, with a very practical view of what they
should be about.

Without diminishing Rich Pattis' teaching experience or skill one bit,
I am somewhat perplexed at the unfortunate stereotypical view you
seem to have of CS profs. Yours is the second post today to have
shown evidence of that stereotypical view; both you and the other
poster have industry addresses.

This is my 22nd year as a CS prof, I travel a lot in CS education
circles, and - while we, like any population, tend to hit a bell
curve - I've found that there are a lot more of us out here than
you may think with Pattis-like commitment to bring the real world
into our teaching.

Sure, there are theorists, as there are in any field, studying
and teaching computing just because it's "beautiful", with little
reference to real application, and there's a definite place in the
teaching world for them. Indeed, exposure to their "purity" of
approach is healthy for undergraduates - there is no harm at all
in taking on computing - sometimes - as purely an intellectual
exercise.

But it's a real reach from there to an assumption that most of us
are in that theoretical category.

I must say that there's a definite connection between an interest
in Ada and an interest in real-world software; certainly most of
the Ada teachers I've met are more like Pattis than you must think.
Indeed, it's probably our commitment to that "engineering" view
of computing that brings us to like and teach Ada.

But it's not just limited to Ada folks. I had the pleasure of
participating in a SIGCSE panel last March entitled "the first
year beyond language." Organized by Owen Astrachan of Duke,
a C++ fan, this panel consisted of 6 teachers of first-year
courses, each using a different language. Pascal, C++, Ada,
Scheme, Eiffel, and (as I recall) ML were represented.

The challenge Owen made to each of us was to give a 10-minute
"vision statement" for first-year courses, without identifying
which language we "represented." Owen revealed the languages to
the audience only after the presentations were done.

It was _really_ gratifying that - with no prior agreement or
discussion among us - five of the six of us presented very similar
visions, in the "computing as engineering" category. It doesn;t
matter which language the 6th used; the important thing was that,
considering the diversity of our backgrounds, teaching everywhere
from small private colleges to big public universities, we were
in _amazing_ agreement.

The message for me in the stereotype presented above is that it's
probably out of date and certainly out of touch. I urge my
industry friends to get out of _their_ ivory towers, and come
visit us. Find out what we're _really_ doing. I think you'll
be pleasantly surprised.

Especially, check out those of us who are introducing students
to _Ada_ as their first, foundation language.

Mike Feldman

------------------------------------------------------------------------
Michael B. Feldman - chair, SIGAda Education Working Group
Professor, Dept. of Electrical Engineering and Computer Science
The George Washington University - Washington, DC 20052 USA
202-994-5919 (voice) - 202-994-0227 (fax)
http://www.seas.gwu.edu/faculty/mfeldman
------------------------------------------------------------------------
Pork is all that money the government gives the other guys.
------------------------------------------------------------------------
WWW: http://lglwww.epfl.ch/Ada/ or http://info.acm.org/sigada/education
------------------------------------------------------------------------

Wayne L. Beavers

unread,

Sep 30, 1996, 3:00:00 AM9/30/96

to

I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
area from damage. When I code in PL/I or any other reentrant language I always make sure that the executable
code is executing from read-only storage. There is no way to put the data areas in read-only storage
(obviously) but I can't think of any reason to put the executable code in writeable storage.

I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another. The
single most common error I had to correct was incorrect usage of pointer variables. I caught a lot of them
when ever they attempted to accidently store into the code area. At that point it is trivial to correct the
bug. This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

Michael Dworetsky

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

In article <84384503...@assen.demon.co.uk> jo...@assen.demon.co.uk (John McCabe) writes:
>r...@goanna.cs.rmit.edu.au (@@ robin) wrote:
>
><..snip..>
>
>Just a point for your information. From clari.tw.space:
>
> "An inquiry board investigating the explosion concluded in
>July that the failure was caused by software design errors in a
>guidance system."
>
>Note software DESIGN errors - not programming errors.
>

Indeed, the problems were in the specifications given to the programmers,
not in the coding activity itself. They wrote exactly what they were
asked to write, as far as I could see from reading the report summary.

The problem was caused by using software developed for Ariane 4's flight
characteristics, which were different from those of Ariane 5. When the
launch vehicle exceeded the boundary parameters of the Ariane-4 software,
it send an error message and, as specified by the remit given to
programmers, a critical guidance system shut down in mid-flight. Ka-boom.

--
Mike Dworetsky, Department of Physics | Haiku: Nine men ogle gnats
& Astronomy, University College London | all lit
Gower Street, London WC1E 6BT UK | till last angel gone.
email: m...@star.ucl.ac.uk | Men in Ukiah.

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Matthew Heaney <mhe...@NI.NET> writes:
>
>Why, yes. If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is. The rocket's gone,
>so what difference does it make how fast the code executed? If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?
>

It's not a case of saving a few CPU cycles so you can run Space
Invaders in the background. Quite often (and in particular in
*space* systems which are limited to rather antiquated
processors) the decision is to a) remove the runtime checks from
the compiled image and run with the possible risk of undetected
constraint errors, etc. or b) give up and go home because there's
no way you are going to squeeze the necessary logic into the box
you've got with all the checks turned on.

It's not as if we take these decisions lightly and are just being
stingy with CPU cycles so we can save them up for our old age. We
remove the checks typically because there's no other choice.

MDC

Marin David Condic, Senior Computer Engineer ATT: 561.796.8997
M/S 731-96 Technet: 796.8997
Pratt & Whitney, GESP Fax: 561.796.4669
P.O. Box 109600 Internet: COND...@PWFL.COM
West Palm Beach, FL 33410-9600 Internet: CON...@FLINET.COM
===============================================================================
"Some people say a front-engine car handles best. Some people say
a rear-engine car handles best. I say a rented car handles best."

-- P. J. O'Rourke
===============================================================================

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Wayne L. Beavers wrote:
>
> I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
> area from damage. When I code in PL/I or any other reentrant language I always make sure that the executable
> code is executing from read-only storage. There is no way to put the data areas in read-only storage
> (obviously) but I can't think of any reason to put the executable code in writeable storage.

That's actually a pretty common rule of thumb for safety-critical systems.
Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
can cause a random change in the memory. So, it's not a perfect fix.

>
> I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another. The
> single most common error I had to correct was incorrect usage of pointer variables. I caught a lot of them
> when ever they attempted to accidently store into the code area. At that point it is trivial to correct the
> bug. This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

--

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Richard Pattis wrote:
>
[snip]

> If I were to try to create a lecture on this topic, what other similar
> failures should I know about (beside the legendary Venus probe)?
> Your comments?

"Safeware" by Levison has some additional good examples about what can
go wrong with software. The RISKS conference also has a lot of info on
this.

There was a study done several years ago by a Dr. Avezzianis (I always screw
up that spelling, and I'm always too lazy to go look it up...) trying to
show the worth of N-version programming. He had five teams of students write
code for part of a flight control system. Each team was given the same set
of control law diagrams (which are pretty detailed, as requirements go), and
each team used the same sort of meticulous software engineering approach that
you would expect for a safety-critical system (no formal methods, however).
Each team's software was almost error-free, based on tests done using the
same test data as the actual delivered flight controls.

Note I said "almost". Every team made one mistake. Worse, it was the _same_
mistake. The control law diagrams were copies. The copier apparently wasn't
a good one, because a comma in one of the gains ended up looking like a
decimal point (or maybe it was the other way around -- I forget). Anyway,
the gain was accidentally coded as 2.345 vs 2,345, or something like that.
That kind of error makes a big difference!

In the face of that kind of error, I've never felt that formal methods had a
chance. That's not to say that formal methods can't detect a lot of different
kinds of failures, but at some level some engineer has to be able to say: "That
doesn't make sense..."

If you want to try to find this study, I believe it was reported at a Digital
Avionics Systems Conference many years ago (in San Jose?), probably around 1986.

>
> Rich

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Alan Brain wrote:
>
> 1. Suppressing all checks in Ada-83 makes about a 5% difference in
> execution speed, in typical real-time and avionics systems. (For
> example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
> hardware budget is this tight,
> you'd better not have lives at risk, or a lot of money, as technical
> risk is
> appallingly high.

Actually, I've seen systems where checks make much more than a 5% difference.
For example, in a flight control system, checks done in the redundancy
management monitor (comparing many redundant inputs in a tight loop) can
easily add 10% or more.

I have also seen flight-critical systems where 5% is a big deal, and where you
can _not_ add a more powerful processor to fix the problem. Flight control
software usually exists in a flight control _system_, with system issues of
power, cooling, space, etc. to consider. On a missile, these are important
issues. You might consider the technical risk "appalingly high," but the fix
for that risk can introduce equally dangerous risks in other areas.

> 2. If you know the range is 0-100, and you get 101, what does this show?
> a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
> soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
> analysis of your "can't happen" situation. As in re-use, or where your
> array comes from an IO channel with noise on....

You forgot (e) - a failure in the inputs. The range may be calculated,
directly or indirectly, from an input to the system. In practice, at least
for the systems I'm familiar with, that's usually where the error came
from -- either a connector fell off, or some wiring shorted out, or a bird
strike took out half of your sensors. I definitely would say that, when we
have a failure reported in operation, it's not usually because of a bug in
the software for our systems!

> Type a) and d) failures should be caught during testing. Most of them.
> OK, some of them. Range checking here is a neccessary debugging aid. But
> type b) and c) can happen too out in the real world, and if you don't
> test for an error early, you often can't recover the situation. Lives or
> $ lost.
>

> Brain's law:
> "Software Bugs and Hardware Faults are no excuse for the Program not to
> work".

Too bad that law can't be enforced :)

Wayne L. Beavers

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Ken Garlington wrote:

> That's actually a pretty common rule of thumb for safety-critical systems.
> Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> can cause a random change in the memory. So, it's not a perfect fix.

Your right, but the risk and probability of memory failures is pretty low I would think. I have never seen
or heard of a memory failure in any of the systems that I have worked on. I don't know what the current
technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
correctable. But I also don't work on realtime systems, my experience is with commercial systems.

Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
refering to ground base systems that don't have similar constraints?

Does anyone know just how good memory ECC is these days?

Wayne L. Beavers way...@beyond-software.com
Beyond Software, Inc.
The Mainframe/Internet Company
http://www.beyond-software.com/

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:

>Alan Brain wrote:
>> A really good safety-critical
>> program should be remarkably difficult to de-bug, as the only way you
>> know it's got a major problem is by examining the error log, and
>> calculating that it's performance is below theoretical expectations.
>> And if it runs too slow, many times in the real-world you can spend 2
>> years of development time and many megabucks kludging the software, or
>> wait 12 months and get the new 400 Mhz chip instead of your current 133.
>
>I really need to change jobs. It sounds so much simpler to build
>software for ground-based PCs, where you don't have to worry about the
>weight, power requirements, heat dissipation, physical size,
>vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
>

I personally like the part about "performance is below theoretical
expectations". Where I live, I have a 5 millisecond loop which
*must* finish in 5 milliseconds. If it runs in 7 milliseconds, we
will fail to close the loop in sufficient time to keep valves from
"slamming into stops", causing them to break, rendering someone's
billion dollar rocket and billion dollar payload "unserviceable".
In this business, that's what *we* mean by "performance is below
theoretical expectations" and why runtime checks which seem
"trivial" to most folks can mean the difference between having a
working system and having an interesting exercise in computer
science which isn't going to go anywhere.

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Robert A Duff <bob...@WORLD.STD.COM> writes:

>Alan Brain <aeb...@dynamite.com.au> wrote:
>>Brain's law:
>>"Software Bugs and Hardware Faults are no excuse for the Program not to
>>work".
>>

>>So: it costs peanuts, and may save your hide.
>
>This reasoning doesn't sound right to me. The hardware part, I mean.
>The reason checks-on costs only 5% or so is that compilers aggressively
>optimize out almost all of the checks. When the compiler proves that a
>check can't fail, it assumes that the hardware is perfect. So, hardware
>faults and cosmics rays and so forth are just as likely to destroy the
>RTS, or cause the program to take a wild jump, or destroy the call
>stack, or whatever -- as opposed to getting a Constraint_Error a
>reocovering gracefully. After all, the compiler doesn't range-check the
>return address just before doing a return instruction!
>

Typically, this is why you build dual-redundant systems. If a
cosmic ray flips some bits in one processor causing bad data which
does/does not get range-checked, then computer "A" goes crazy and
computer "B" takes control. Hopefully they don't *both* get hit by
cosmic rays at the same time.

The real danger is a common mode failure where a design flaw
exists in the software used by both channels - they both see the
same inputs and both make the same mistake. Of course trapping
those exceptions doesn't necessarily guarantee success since
either the exception handler or the desired accommodation could
also be flawed and the flaw will, by definition, exist in both
channels.

If all you're protecting against is software design failures (not
hardware failures) then obviously being able to analyze code and
prove that a particular case can never happen should be sufficient
to permit the removal of runtime checks.

Ken Garlington

unread,

Oct 1, 1996, 3:00:00 AM10/1/96

to

Wayne L. Beavers wrote:
>
> Ken Garlington wrote:
>
> > That's actually a pretty common rule of thumb for safety-critical systems.
> > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > can cause a random change in the memory. So, it's not a perfect fix.
>
> Your right, but the risk and probability of memory failures is pretty low I would think. I have never seen
> or heard of a memory failure in any of the systems that I have worked on. I don't know what the current
> technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> correctable. But I also don't work on realtime systems, my experience is with commercial systems.
>
> Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> refering to ground base systems that don't have similar constraints?

On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment
you can get quite a few failure _sources_, including mechanical failures (stress
fractures, solder loss due to excessive heat, etc.), electrical failures (EMI,
lightening), and so forth. You don't have to take out the actual chip, of course: just
as bad is a failure in the address or data lines connecting the memory to the CPU. Add
a memory management unit to the mix, along with various I/O devices mapped into the
memory space, and you can get a whole slew of memory-related failure modes.

You can also get into some neat system failures. For example, some "read-only" memory
actually allows writes to the execution space in certain modes, to allow quick
reprogramming. If you have a system failure that allows writes at the wrong time,
coupled with a failure that does a write where it shouldn't...

Alan Brain

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

Marin David Condic, 407.796.8997, M/S 731-93 wrote:
>
> Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:

> >I really need to change jobs. It sounds so much simpler to build
> >software for ground-based PCs, where you don't have to worry about the
> >weight, power requirements, heat dissipation, physical size,
> >vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
> >

The particular system I was talking about was for a Submarine. Very
tight
constraints indeed, on power (it was a diesel sub), physical size (had
to
fit in a torpedo hatch), heat dissipation (a bit), vulnerability to 100%
humidity, salt, chlorine etc etc. Been there, Done that, Got the
T-shirt.

I'm a Software Engineer who works mainly in Systems. Or maybe a Systems
Engineer with a hardware bias. Regardless, in the initial Systems
Engineering
phase, when one gets all the HWCIs and CSCIs defined, it is only good
professional practice to build in plenty of slack. If the requirement is
to fit
in a 21" hatch, you DON'T design something that's 20.99999" wide. If you
can,
make it 16", 18 at max. It'll probably grow. Similarly, if you require a
minimum
of 25 MFlops, make sure there's a growth path to at least 100. It may
well be less
expensive and less risky to build a chip factory to make a faster CPU
than to
lose a rocket, or a sub due to software failure that could have been
prevented.
Usually such ridiculously extreme measures are not neccessary. The
Hardware guys
bitch about the cost-per-CPU going through the roof. Heck, it could cost
$10 million.
But if it saves 2 years of Software effort, that's a net saving of $90
million.
(All numbers are representative ie plucked out of mid-air, and as you
USAians say,
Your Mileage May Vary)

> I personally like the part about "performance is below theoretical
> expectations". Where I live, I have a 5 millisecond loop which
> *must* finish in 5 milliseconds. If it runs in 7 milliseconds, we
> will fail to close the loop in sufficient time to keep valves from
> "slamming into stops", causing them to break, rendering someone's
> billion dollar rocket and billion dollar payload "unserviceable".
> In this business, that's what *we* mean by "performance is below
> theoretical expectations" and why runtime checks which seem
> "trivial" to most folks can mean the difference between having a
> working system and having an interesting exercise in computer
> science which isn't going to go anywhere.

In this case, "theoretical expectations" for a really tight 5 MuSec loop
should be less than 1 MuSec. Yes, I'm dreaming. OK, 3 MuSec, that's my
final offer. For the vast majority of cases, if your engineering is
closer to
the edge than that, it'll cost big bucks to fix the over-runs you always
get.

Typical example: I had a big bun-fight with project management about a
hefty
data transfer rate required for a broadband sonar. They wanted to
hand-code
the lot in assembler, as the requirements were really, really tight. No
time
for any of this range-check crap, the data was always good.
I eventually threw enough of a professional tantrum to wear down even a
group
of German Herr Professor Doktors, and we did it in Ada-83. If only as a
first
pass, to see what the rate really would be.
The spec called for 160 MB/Sec. First attempt was 192 MB/Sec, and after
some optimisation, we got over 250. After the hardware flaws were fixed
(the ones
the "un-neccessary" range-bound checking detected ) this was above 300.
Now that's
too close for my druthers, but even 161 I could live with. Saved maybe
16 months
on the project, about 100 people at $15K a month. After the transfer,
the data
really was trustworthy - which saved a lot of time downstream on the
applications
in debug time.
Note that even with (minor) hardware flaws, the system still worked.
Note also
that by paying big $ for more capable hardware than strictly neccessary,
you
can save bigger $ on the project.
Many projects spend many months and many $ Million to fix, by hacking,
Kludging,
and sheer Genius what a few lousy $100K of extra hardware cost would
make
un-neccessary. A good software engineer in the Risk-management team, and
on the
Systems Engineering early on, one with enough technical nous in hardware
to know
what's feasible, enough courage to cost the firm millions in initial
costs, and
enough power to make it stick, that's what's neccessary. I've seen it;
it works.

But it's been tried less than a dozen times in 15 years in my experience
:(

Sandy McPherson

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

It depends upon what you mean by a memory failure. I can imagine that
the chances of your memory being trashed completely is very very low and
in rad-hardened systems the chances of a single-event-upset (SEU) is
also low, but has to be guarded against. I have recently been working on
a system where the specified hardware has a parity bit for each octet of
memory, so SEUs which flip bit values in the memory can be detected.
This parity check is built into the system's micro-code.

Similarily the definition of what is and isn't read only memory is
usually a feature of the processor and or operating system being used. A
compiler cannot put code into read only areas of memory, unless the
processor its micro-code and/or o/s are playing ball as well. If you are
unfortunate enough to be in this situation (are there any such systems
left?), then the only thing you can do is DIY, but the compiler can't
help you much, other than the for-use-at.

I once read an interesting definition of two types of bugs in
"transaction processing" by Gray & Reuter, Heisenbugs and Bohrbugs.

Identification of potential Heisenbugs, estimation of probability of
occurence, impact to system on occurrence and appropriate recovery
procedures are part of the risk analysis. An SEU is a classic Heisenbug,
which IMO is out of scope of compiler checks, because they can result in
a valid but incorrect value for a variable and are just as likely to
occur in the code section as the data section of your application. A
complete memory failure is of course beyond the scope of the compiler.

IMO an Ada compiler's job (when used properly) is to make sure that
syntactic Bohrbugs do not enter a system and all semantic Bohrbugs get
detected at runtime (as Bohrbugs, by definition have a fixed location
and are certain to occur under given conditions- the Ariane 5 bug was
definitely a Bohrbug). The compiler cannot do anything about Heisenbugs
(because they only have a probability of occurrence). To handle
Heisenbugs generally you need to have a detection, reporting and
handling mechanism: built using the hardwares error detection, generally
accepted software practices (e.g. duplicate storage, process-pairs) and
an application dependent exception handling mechanism. Ada provides the
means to trap the error condition once it has been reported, but it does
not implement exception handlers for you, other than the default "I'm
gone..."; additionally if the underlying system does not provide the
means to detect a probable error, you have to implement the means of
detectin the probel and reporting this through the Ada exception
handling yourself.

--
Sandy McPherson MBCS CEng. tel: +31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk

Simon Johnston

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

Michael Feldman wrote:
> In article <1996Sep29.1...@enterprise.rdd.lmsc.lockheed.com>,
> Chris McKnight <cmck...@hercii.lasc.lockheed.com> wrote:

>=20

> [Rich Pattis' good stuff snipped.]
> >

> > An excellent bit of teaching, IMHO. Glad to hear they're putting =

some
> > more of the real world issues in the class room.

>=20

> Rich Pattis is indeed an experienced, even gifted teacher of
> introductory courses, with a very practical view of what they
> should be about.

>=20

> Without diminishing Rich Pattis' teaching experience or skill one bit,
> I am somewhat perplexed at the unfortunate stereotypical view you
> seem to have of CS profs. Yours is the second post today to have
> shown evidence of that stereotypical view; both you and the other
> poster have industry addresses.

I think some of it must come from experience, I have met some really =
good, industry focused profs ho teach with a real "useful" view (my =
first serious language was COBOL!). I have also met the "computer =
science" guys, without whom we would never move forward. I have also met =
some inbetween who really don't have that engineering focus or the =
science.
=20

> This is my 22nd year as a CS prof, I travel a lot in CS education
> circles, and - while we, like any population, tend to hit a bell
> curve - I've found that there are a lot more of us out here than
> you may think with Pattis-like commitment to bring the real world
> into our teaching.

Mike, I know from your books and postings here the level of engineering =
you bring to your teaching, we are discussing (I believe) the balance in =
teaching computing as an engineering discipline or as an ad-hoc =
individual "art".

> Sure, there are theorists, as there are in any field, studying
> and teaching computing just because it's "beautiful", with little
> reference to real application, and there's a definite place in the
> teaching world for them. Indeed, exposure to their "purity" of
> approach is healthy for undergraduates - there is no harm at all
> in taking on computing - sometimes - as purely an intellectual
> exercise.

>=20

> But it's a real reach from there to an assumption that most of us
> are in that theoretical category.

I don't think many of the people I work with have made this leap.
=20

> I must say that there's a definite connection between an interest
> in Ada and an interest in real-world software; certainly most of
> the Ada teachers I've met are more like Pattis than you must think.
> Indeed, it's probably our commitment to that "engineering" view
> of computing that brings us to like and teach Ada.

Certainly (or as in my case COBOL) it leads you into an application =
oriented way of thinking which makes you think about requirements, =
testing etc.

[snip]

let me give you a little anecdote f my own.=20
I recently went for a job interview with a very large well-known =
software firm. Firstly they wanted me write the code to traverse a =
binary tree for which they described the (C) data structures. Then I was =
asked to write code to insert a node in a linked list (I had to ask what =
the requirements for cases such as the list being empty or the node =
already existing where). Finally I was asked to write the code to find =
all the anagrams in a given string.
There were no business type questions, no true analytical questions, the =
things which as an engineer I have to do each day. The problems set me =
have a simple and single answer which I don't write each day. I am sure =
you can recite off hand the way to traverse a binary tree, but I have to =
stop and think because I wrote it ONCE, AGES AGO and wrote it as a =
GENERIC which I can REUSE. I know an understanding of these algorithms =
is required so that I can decide which of my generics to use, but that =
is why I invest in good books!
By the way I happen to know someone who works for this firm who told me =
that graduate programmers seem to do well in their interview process, he =
once interviewed an engineer with 20 years industry experience and a PhD =
who got up and left half way through the interview in disgust.

with StandardDisclaimer; use StandardDisclaimer;
package Sig is
--,----------------------------------------------------------------------=
---.
--|Simon K. Johnston - Development Engineer (C++/Ada95) |ICL Retail =
Systems |
--|-----------------------------------------------------|3/4 Willoughby =
Road|
--|Internet : s...@acm.org |Bracknell =
|
--|Telephone: +44 (0)1344 476320 Fax: +44 (0)1344 476302|Berkshire =
|
--|Internal : 7261 6320 OP Mail: S.K.Johnston@BRA0801 |RG12 8TJ =
|
--|WWW URL : http://www.acm.org/~skj/ |United Kingdom =
|
--`----------------------------------------------------------------------=
---'
end Sig;

Robert I. Eachus

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

In article <9610011...@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <cond...@PWFL.COM> writes:

Marin David Condic

> It's not a case of saving a few CPU cycles so you can run Space
> Invaders in the background. Quite often (and in particular in
> *space* systems which are limited to rather antiquated
> processors) the decision is to a) remove the runtime checks from
> the compiled image and run with the possible risk of undetected
> constraint errors, etc. or b) give up and go home because there's
> no way you are going to squeeze the necessary logic into the box
> you've got with all the checks turned on.

> It's not as if we take these decisions lightly and are just being
> stingy with CPU cycles so we can save them up for our old age. We
> remove the checks typically because there's no other choice.

In this case though, management threw out the baby with the
bathwater. To preserve a 20% margin in the presence of a kludge
already known to be applicable only to the Ariane 4, they took out
checks that would be vital if the kludge ran on the Ariane 5, then
forgot to take the kludge out.

The proper solution was to recognize in the performance specs that
the load was 81% or whatever until the intertial alignment software
shut down after launch.

--

Robert I. Eachus

with Standard_Disclaimer;
use Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

Ken Garlington

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

Marin David Condic, 407.796.8997, M/S 731-93 wrote:
>
> The real danger is a common mode failure where a design flaw
> exists in the software used by both channels - they both see the
> same inputs and both make the same mistake. Of course trapping
> those exceptions doesn't necessarily guarantee success since
> either the exception handler or the desired accommodation could
> also be flawed and the flaw will, by definition, exist in both
> channels.

The problem also exists if you have a common-mode _hardware_ failure (e.g.
a hardware design fault, or an external upset like lightning that hits
both together).

--
LMTAS - "Our Brand Means Quality"

For more info, see http://www.lmtas.com or http://www.lmco.com

Ken Garlington

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

Robert I. Eachus wrote:
>
> In article <9610011...@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <cond...@PWFL.COM> writes:
>
> Marin David Condic
>
> > It's not a case of saving a few CPU cycles so you can run Space
> > Invaders in the background. Quite often (and in particular in
> > *space* systems which are limited to rather antiquated
> > processors) the decision is to a) remove the runtime checks from
> > the compiled image and run with the possible risk of undetected
> > constraint errors, etc. or b) give up and go home because there's
> > no way you are going to squeeze the necessary logic into the box
> > you've got with all the checks turned on.
>
> > It's not as if we take these decisions lightly and are just being
> > stingy with CPU cycles so we can save them up for our old age. We
> > remove the checks typically because there's no other choice.
>
> In this case though, management threw out the baby with the
> bathwater. To preserve a 20% margin in the presence of a kludge
> already known to be applicable only to the Ariane 4, they took out
> checks that would be vital if the kludge ran on the Ariane 5, then
> forgot to take the kludge out.

The critical part of this correct statement, of course, being "In this
case..". In another context, this might have been the right decision.

It's also important to remember that Ariane 5 didn't exist when the
Ariane 4 team made this decision. They may have been short-sighted, but
they weren't idiots based on what they knew at the time.

The Ariane _5 management not doing sufficient re-analysis and re-test of
this "off-the-shelf" system is, to me, much less excusable.

Ken Garlington

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to aeb...@dynamite.com.au

Alan Brain wrote:
>
> Marin David Condic, 407.796.8997, M/S 731-93 wrote:
> >
> > Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:
>
> > >I really need to change jobs. It sounds so much simpler to build
> > >software for ground-based PCs, where you don't have to worry about the
> > >weight, power requirements, heat dissipation, physical size,
> > >vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
> > >
>
> The particular system I was talking about was for a Submarine. Very
> tight
> constraints indeed, on power (it was a diesel sub), physical size (had
> to
> fit in a torpedo hatch), heat dissipation (a bit), vulnerability to 100%
> humidity, salt, chlorine etc etc. Been there, Done that, Got the
> T-shirt.

So what did you do when you needed to build a system that was bigger than the
torpedo hatch? Re-design the submarine? You have physical limits that you just can't
exceed. On a rocket, or an airplane, you have even stricter limits.

Oh for the luxury of a diesel generator! We have to be able to operate on basic
battery power (and we share that bus with emergency lighting, etc.)

> I'm a Software Engineer who works mainly in Systems. Or maybe a Systems
> Engineer with a hardware bias. Regardless, in the initial Systems
> Engineering
> phase, when one gets all the HWCIs and CSCIs defined, it is only good
> professional practice to build in plenty of slack. If the requirement is
> to fit
> in a 21" hatch, you DON'T design something that's 20.99999" wide. If you
> can,
> make it 16", 18 at max. It'll probably grow.

Exactly. You build a system that has slack. Say, 15% slack. Which is exactly
why the INU design team didn't want to add checks unless they had to. Because
they were starting to eat into that slack.

> Similarly, if you require a
> minimum
> of 25 MFlops, make sure there's a growth path to at least 100. It may
> well be less
> expensive and less risky to build a chip factory to make a faster CPU
> than to
> lose a rocket, or a sub due to software failure that could have been
> prevented.

What if your brand new CPU requires more power than your diesel generator
can generate?

What if your brand new CPU requires a technology that doesn't let you meet
your heat dissipation?

Doesn't sound like you had to make a lot of tradeoffs in your system.
Unfortunately, airborne systems, particular those that have to operate in
lower-power, zero-cooling situations (amazing how hot the air gets around
Mach 1!), don't have such luxuries.

> Usually such ridiculously extreme measures are not neccessary. The
> Hardware guys
> bitch about the cost-per-CPU going through the roof. Heck, it could cost
> $10 million.
> But if it saves 2 years of Software effort, that's a net saving of $90
> million.

What does maintenance costs have to do with this discussion?

> In this case, "theoretical expectations" for a really tight 5 MuSec loop
> should be less than 1 MuSec. Yes, I'm dreaming. OK, 3 MuSec, that's my
> final offer. For the vast majority of cases, if your engineering is
> closer to
> the edge than that, it'll cost big bucks to fix the over-runs you always
> get.

I've never had a project yet where we didn't routinely cut it that fine,
and we've yet to spend the big bucks. If you're used to developing systems
with those kind of constraints, you know how to make those decisions.
Occasionally, you make the wrong decision, as the Ariane designers discovered.
Welcome to engineering.

> Typical example: I had a big bun-fight with project management about a
> hefty
> data transfer rate required for a broadband sonar. They wanted to
> hand-code
> the lot in assembler, as the requirements were really, really tight. No
> time
> for any of this range-check crap, the data was always good.
> I eventually threw enough of a professional tantrum to wear down even a
> group
> of German Herr Professor Doktors, and we did it in Ada-83. If only as a
> first
> pass, to see what the rate really would be.
> The spec called for 160 MB/Sec. First attempt was 192 MB/Sec, and after
> some optimisation, we got over 250. After the hardware flaws were fixed
> (the ones
> the "un-neccessary" range-bound checking detected ) this was above 300.

And, if you had only got 20MB per second after all that, you would have
done...?

Certainly, if you just throw out range checking without knowing its cost,
you're an idiot. However, no one has shown that the Ariane team did this.
I guarantee you (and am willing to post object code to prove it) that
range checking is not always zero cost, and in the right circumstances can
cause you to bust your budget.

> Note also
> that by paying big $ for more capable hardware than strictly neccessary,
> you
> can save bigger $ on the project.

Unfortunately, cost is not the only controlling variable.

Interesting that a $100K difference in per-unit cost in your systems is
negligible. No wonder people think military systems are too expensive!

Matthew Heaney

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

In article <9610011...@psavax.pwfl.com>, "Marin David Condic,

407.796.8997, M/S 731-93" <cond...@PWFL.COM> wrote:

It's not a case of saving a few CPU cycles so you can run Space
> Invaders in the background. Quite often (and in particular in
> *space* systems which are limited to rather antiquated
> processors) the decision is to a) remove the runtime checks from
> the compiled image and run with the possible risk of undetected
> constraint errors, etc. or b) give up and go home because there's
> no way you are going to squeeze the necessary logic into the box
> you've got with all the checks turned on.
>
> It's not as if we take these decisions lightly and are just being
> stingy with CPU cycles so we can save them up for our old age. We
> remove the checks typically because there's no other choice.

Funny you mention that, because I would have said take option b. My
attitude is that there is a state of the art today, and it's not cost
effective to try to push too far beyond that.

I'm not unsympathetic to your situation, as my own background is in
real-time (ground-based) systems. But when you try to push the technology
envelope beyond what is (easily) available today, the cost of your system
and the risk of failure shoots up.

To do what you wanted to do with your existing hardware meant you had to
turn off checks. Fair enough. But that decision very much increased your
risk that something bad would happen from which you wouldn't be able to
recover.

I heard those satellites cost $500 million dollars. Was turning off those
checks really worth the risk of losing that much money? To me you were
just gambling.

I would have said that, no, the risk is too great. Scale back the
requirements and let's do something less ambitious. If you really want to
do that, wait 18 months and Dr. Moore will give you hardware that's twice
as fast. But if you want to do it today, and you have turn the checks off,
well then, you're just rolling the dice.

The state of software art today is such that we can't deploy a provably
correct system, and we have resort to run-time checks to catch logical
flaws. I accept this "limitation," and I accept that there are certain
kinds of systems we can't do today (because to do them would require
turning off checks).

Buyers of mission-critical software should think very carefully before they
commit any financial resources to implementing a software system that
requires checks be turned off. I'd say take your money instead to Las
Vegas: your odds for success are better there.

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mhe...@ni.net
(818) 985-1271

Matthew Heaney

unread,

Oct 2, 1996, 3:00:00 AM10/2/96

to

In article <3252B4...@lmtas.lmco.com>, Ken Garlington
<garlin...@lmtas.lmco.com> wrote:

>Interesting that a $100K difference in per-unit cost in your systems is
>negligible. No wonder people think military systems are too expensive!

I think he meant "negligable compared to the programming cost that would be
required to get the software to run on the cheaper hardware."

It's never cost effective to skimp on hardware if it means human
programmers have to write more complex software.

Richard A. O'Keefe

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

"Wayne L. Beavers" <way...@beyond-software.com> writes:

>I have been reading this thread awhile and one topic that I have not
>seen mentioned is protecting the code area from damage.

I imagine that everyone else has taken this for granted.
UNIX compilers have been doing it for years, and so I believe have VMS ones.

>When I code in PL/I or any other reentrant language I always make sure
>that the executable code is executing from read-only storage.

(a) This is not something that the programmer should normally have to be
concerned with, it just happens.
(b) It cannot always be done. Run-time code generation is a practical and
important technique. (Making a page read-only after new code has been
written to it is a good idea, of course.)

>There is no way to put the data areas in read-only storage (obviously)

It may be obvious, but in important cases it isn't true.
UNIX (and I believe VMS) compilers have for years had the ability to put
_selected_ data in read-only storage. And of course it is perfectly
feasible in many operating systems (certainly UNIX and VMS) to write data
into a page and then ask the operating system to make that page read-only.

>but I can't think of any reason to put the executable code in writeable
>storage.

Run-time binary translation. Some approaches to relocation. How many
reasons do you want?

>I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable
>code from one system to another.

In a language where the last revision of the standard was 1976?
You have my deepest sympathy.

--
Australian citizen since 14 August 1996. *Now* I can vote the xxxs out!
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.

Alan Brain

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

Ken Garlington wrote:

> So what did you do when you needed to build a system that was bigger than the
> torpedo hatch? Re-design the submarine?

Nope, we re-designed the system so it fit anyway. Actually, we designed
the thing in the first place so that the risk of it physically growing
too big and needing re-design was tolerable (ie contingency money was
allocated for doing this, if we couldn't accurately estimate the risk as
being small).

> Oh for the luxury of a diesel generator! We have to be able to operate on basic
> battery power (and we share that bus with emergency lighting, etc.)

Well ours had a generator connected to a hamster wheel with a piece of
cheese as backup ;-).... but seriously folks, yes we have a diesel. Why?
to charge the batteries. Use of the Diesel under many conditions - eg
when taking piccies in Vladivostok Harbour - would be unwise.

> Exactly. You build a system that has slack. Say, 15% slack. Which is exactly
> why the INU design team didn't want to add checks unless they had to. Because
> they were starting to eat into that slack.

I'd be very, very suspicious of a slack like "15%". This implies you
know to within 2 significant figures what the load is going to be. Which
in my experience is not the case. "About a Seventh" is more accurate, as
it implies more imprecision. And I'd be surprised if any Bungee-Jumper
would tolerate that small amount of safety margin using new equipment.
Then again, slack is supposed to be used up. It's for the unforeseen.
When you come across a problem during development, you shouldn't be
afraid of using up that slack, that's what it's there for! One is
reminded of the apocryphal story of the quartemaster at Pearl Harbour,
who refused to hand out ammunition as it could have been needed more
later.

> What if your brand new CPU requires more power than your diesel generator
> can generate?
> What if your brand new CPU requires a technology that doesn't let you meet
> your heat dissipation?

But it doesn't. When you did your initial systems engineering, you made
sure there was enough slack - OR had enough contingency money so that
you could get custom-built stuff.

> Doesn't sound like you had to make a lot of tradeoffs in your system.
> Unfortunately, airborne systems, particular those that have to operate in
> lower-power, zero-cooling situations (amazing how hot the air gets around
> Mach 1!), don't have such luxuries.

I see your zero-cooling situations, and I raise you H2, CO2, CO, Cl, H3O
conditions etc. The constraints on a sub are different, but the same in
scope. Until such time as you do work on a sub, or I do more than just a
little work on aerospace, we may have to leave it at that.

> > Usually such ridiculously extreme measures are not neccessary. The
> > Hardware guys
> > bitch about the cost-per-CPU going through the roof. Heck, it could cost
> > $10 million.
> > But if it saves 2 years of Software effort, that's a net saving of $90
> > million.
>
> What does maintenance costs have to do with this discussion?

Sorry I didn't make myself clear: I was talking development costs, not
maintenance.

> I've never had a project yet where we didn't routinely cut it that fine,
> and we've yet to spend the big bucks.

Then I guess either a) You're one heck of a better engineer than me (and
I freely admit the distinct possibility) or b) You've been really lucky
or c) You must tolerate a lot more failures than the organisations I've
worked for.

> If you're used to developing systems
> with those kind of constraints, you know how to make those decisions.
> Occasionally, you make the wrong decision, as the Ariane designers discovered.
> Welcome to engineering.

My work has only killed 2 people (Iraqi pilots - that particular system
worked as advertised in the Gulf). There might be as many as 5000 people
whose lives depend on my work at any time, more if War breaks out. I
guess we have a different view of "acceptable losses" here, and your
view may well be more correct. Why? Because such a conservative view as
my own may mean I just can't attempt some risky things. Things which
your team (sometimes at least) gets working, teherby saving more lives.
Yet I don't think so.

> And, if you had only got 20MB per second after all that, you would have
> done...?

20 MB? First, re-check all calculations. Examine hardware options. Then
(probably) set up a "get-well" program using 5-6 different tracks and
pick the best. Most probably though, we'd give up: it's not doable
within the budget. The difficult case is 150 MB. In this case, assembler
coding might just make the difference - I do get your point, BTW.

> Certainly, if you just throw out range checking without knowing its cost,
> you're an idiot. However, no one has shown that the Ariane team did this.
> I guarantee you (and am willing to post object code to prove it) that
> range checking is not always zero cost, and in the right circumstances can
> cause you to bust your budget.

Agree. There's always pathological cases where general rules don't
apply. Being fair, I didn't say "zero cost", I said "typically 5%
measured". In doing the initial Systems work, I'd usually budget for
10%, as I'm paranoid.

> Unfortunately, cost is not the only controlling variable.
>
> Interesting that a $100K difference in per-unit cost in your systems is
> negligible. No wonder people think military systems are too expensive!

You get what you pay for, IF you're lucky. My point though is that many
of the hacks, kludges etc in software are caused by insufficient
foresight in systems design. Case in point: RAN Collins class submarine.
Now many years late due to software problems. Last time I heard, they're
still trying to get that last 10% performance out of the 68020s on the
cards. Which were leading-edge when the systems work was done. Putting
in 68040s a few years ago would have meant the Software would have been
complete by now, as the hacks wouldn't have been neccessary.

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:
>So what did you do when you needed to build a system that was bigger than the

>torpedo hatch? Re-design the submarine? You have physical limits that you just
>can't
>exceed. On a rocket, or an airplane, you have even stricter limits.
>

>Oh for the luxury of a diesel generator! We have to be able to operate on basic
>battery power (and we share that bus with emergency lighting, etc.)
>

Just as you have physical limits and need to leave physical
margins, software has timing limits and needs to leave timing
margins. Both to accommodate the inevitable change and growth as
production units are fielded, but also as a *safety* factor. What
would happen to the Ariane 5 if that 80% utilization went to 105%
because the software hit an untested "corner case"? It's a good
reason to insist on leaving some margin.

You have emergency lighting? Lucky dog!

>What if your brand new CPU requires more power than your diesel generator
>can generate?
>
>What if your brand new CPU requires a technology that doesn't let you meet
>your heat dissipation?
>

>Doesn't sound like you had to make a lot of tradeoffs in your system.
>Unfortunately, airborne systems, particular those that have to operate in
>lower-power, zero-cooling situations (amazing how hot the air gets around
>Mach 1!), don't have such luxuries.
>

You get zero-cooling? Lucky dog! My box just keeps getting hotter
and hotter until it burns up. Hopefully *after* the mission is
over.

You get *air???!*! And never mind that Mach 1 stuff - my box is
strapped to the side of a blow-torch!

You're absolutely right about the engineering tradeoffs - In
flight systems especially since the biggest constraint is
typically weight & space. (Two commodities that are *much* easier
to compromise on when you get to sit on the ground - or sink under
the ocean) I'd gladly give my eye teeth to get double the CPU
speed I've got. Unfortunately, this is the best that can be done
within the current CPU technology and adding a second processor is
out of the question at this time: The box can't get heavier or
bigger without risking payload, power consumption and heat
disapation go up, etc. etc. etc. If it weren't for the megabucks
and the chance to meet chicks, I'd quit the engineering business
because of the headaches.

>And, if you had only got 20MB per second after all that, you would have
>done...?
>

Anyone can afford to be a purist right up to the point where they
have to tell their boss that they're at 105% utilization and that
the project they've invested millions on won't work. At that
point, you start looking at what you might inline to avoid
procedure call overhead, recode sections in assembler because you
can be smarter at it than the compiler, and yes, remove all those
extraneous runtime checks and prove out your code instead.

>Certainly, if you just throw out range checking without knowing its cost,
>you're an idiot. However, no one has shown that the Ariane team did this.
>I guarantee you (and am willing to post object code to prove it) that
>range checking is not always zero cost, and in the right circumstances can
>cause you to bust your budget.
>

Amen! Let's say you have 20 computations. Lets say that the
runtime checks total time is 5uSec. (Not unrealistic on many
processors where the average instruction uses 0.5 to 1.0uSec)
That's 100uSec. Suppose this code needs to run once every 1mSec.
Your runtime checks just consumed 10% of your CPU.

We did *exactly* this sort of analysis (both bench checking and
running sample code) and concluded that the runtime checks were
out or the project wouldn't work. And we're using one of the
*best* Ada compilers available for the 1750a - the EDS-Scicon
XD-Ada compiler.

MDC

Marin David Condic, Senior Computer Engineer ATT: 561.796.8997
M/S 731-96 Technet: 796.8997
Pratt & Whitney, GESP Fax: 561.796.4669
P.O. Box 109600 Internet: COND...@PWFL.COM
West Palm Beach, FL 33410-9600 Internet: CON...@FLINET.COM
===============================================================================

Glendower: "I can call spirits from the vasty deep."
Hotspur: "Why so can I, or so can any man; but will they come when
you do call for them?"

-- Shakespeare, "Henry IV"
===============================================================================

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:
>> Brain's law:
>> "Software Bugs and Hardware Faults are no excuse for the Program not to
>> work".
>

>Too bad that law can't be enforced :)
>

Yup! Hardware faults - such as a CPU out to lunch - can pretty
much be impossible to fix with the software that's running on it.
As for software faults, isn't it a little like being in the
"Physcian, heel thyself!" mode? I am insane - let me diagnose and
cure my own insanity...But being insane, can I know that my
diagnosis and/or cure isn't also insane? A bit of a paradox, no?

Yes, yes, yes. Exception handlers and so on can do a remarkable
job of catching problems and fixing them. But out of the set of
all possible software bugs, there is a non-empty set containing
software bugs which mean your program has gone insane.

You can only accommodate the bugs and/or faults which you can
think of. What about the few hundred bugs/faults you *didn't*
think of? Bet your donkey that they're going to happen someday,
somewhere and the only way you're going to learn about them is by
having them rear their ugly heads. Ask the engineers who designed
The Tacoma Bridge or the o-rings on the space shuttle about it.

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 3, 1996, 3:00:00 AM10/3/96

to

Ken Garlington <garlin...@LMTAS.LMCO.COM> writes:

>Wayne L. Beavers wrote:
>>
>> I have been reading this thread awhile and one topic that I have not seen
>mentioned is protecting the code

>> area from damage. When I code in PL/I or any other reentrant language I

>always make sure that the executable

>> code is executing from read-only storage. There is no way to put the data
>areas in read-only storage
>> (obviously) but I can't think of any reason to put the executable code in
>writeable storage.
>

>That's actually a pretty common rule of thumb for safety-critical systems.
>Unfortunately, read-only memory isn't exactly read-only. For example, hardware
>errors
>can cause a random change in the memory. So, it's not a perfect fix.
>

Actually there is a reason for sucking the code out of EEPROM and
into RAM. EEPROMs (as I understand what the hardware dweebes tell
me) are unusually susceptible to single event upsets (SEUs) if you
have lots of gamma radiation hanging around in the neighborhood.
Whereas RAMs are easier to make Rad-Hard and survive this stuff
better.

This poses problems for us software geeks to solve when creating
the bootstrap, but there are apparently good engineering reasons
for doing so. It would be nice if we could simply put an S.E.P.
Field (S.omebody E.lses P.roblem) around the hardware issues, but
once in a while the software guys have to bail out the hardware
guys because of physics.

Robert S. White

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

In article <mheaney-ya0231800...@news.ni.net>, mhe...@ni.net
says...

>It's never cost effective to skimp on hardware if it means human
>programmers have to write more complex software.

Not if the ratio is tilted very heavy towards reoccuring cost versus
Non-Reoccuring Engineering (NRE). How about 12 staff-months versus $300 extra
hardware cost on 60,000 units?

___________________________________________________________________________
Robert S. White -- an embedded systems software engineer
Whi...@CRPL.Cedar-Rapids.lib.IA.US -- It's long, but I pay for it!

@@ robin

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Lawrence Foard <ent...@vwis.com> writes:

>Ronald Kunne wrote:

>> Actually, this was the case here: the code was taken from an Ariane 4
>> code where it was physically impossible that the index would go out
>> of range: a test would have been a waste of time.

---A test for overflow in a system that aborts if unexpected overflow
occurs, is never a waste of time.

Recall Murphy's Law: "If anything can go wrong, it will."
Then there's Robert's Law: "Even if it can't go wrong, it will."

>> Unfortunately this was no longer the case in the Ariane 5.

>Actually it would still present a danger on Ariane 4. If the sensor
>which apparently was no longer needed during flight became defective,
>then you could get a value out of range.

---Good point Lawrence.

@@ robin

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

jo...@assen.demon.co.uk (John McCabe) writes:

>Just a point for your information. From clari.tw.space:

> "An inquiry board investigating the explosion concluded in
>July that the failure was caused by software design errors in a
>guidance system."

>Note software DESIGN errors - not programming errors.

>Best Regards
>John McCabe <jo...@assen.demon.co.uk>

---If you read the Report, you'll see that that's not the case.
This is what the report says:

"* The internal SRI software exception was caused during execution of a
data conversion from 64-bit floating point to 16-bit signed integer
value. The floating point number which was converted had a value
greater than what could be represented by a 16-bit signed integer.
This resulted in an Operand Error. The data conversion instructions
(in Ada code) were not protected from causing an Operand Error,
although other conversions of comparable variables in the same place
in the code were protected.

"In the failure scenario, the primary technical causes are the Operand Error
when converting the horizontal bias variable BH, and the lack of protection
of this conversion which caused the SRI computer to stop."

---As you can see, it's clearly a programming error. It's a failure
to check for overflow on converting a double precision value to
a 16-bit integer.

Michel OLAGNON

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

But if you read a bit further on, it is stated that

The reason why three conversions, including the horizontal bias variable one,
were not protected, is that it was decided that they were physically bounded
or had a wide safety margin (...) The decision was a joint one of the project
partners at various contractual levels.

Deciding at various contractual levels is not what one usually means by
``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
can give any word any meaning.
And it might be probable that the action taken in case of protected conversion,
and exception, would also have been stop the SRI computer because such a high
horizontal bias would have meant that it was broken....

Michel

--
| Michel OLAGNON email : Michel....@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Steve Bell

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Michael Dworetsky wrote:
>
> >Just a point for your information. From clari.tw.space:
> >
> > "An inquiry board investigating the explosion concluded in
> >July that the failure was caused by software design errors in a
> >guidance system."
> >
> >Note software DESIGN errors - not programming errors.
> >
>

> Indeed, the problems were in the specifications given to the programmers,
> not in the coding activity itself. They wrote exactly what they were
> asked to write, as far as I could see from reading the report summary.
>
> The problem was caused by using software developed for Ariane 4's flight
> characteristics, which were different from those of Ariane 5. When the
> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
> it send an error message and, as specified by the remit given to
> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
>

I work for an aerospace company, and we recieved a fairly detailed accounting of what
went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch
pad, run a guidance program that updates their position and velocity in reference to
an coordinate frame whose origin is at the center of the earth (usually called an
Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4
hours before launch and is allowed to run all the way until liftoff, so that the
rocket will know where it's at and how fast it's going at liftoff. Although called
"ground software," (because it runs while the rocket is on the ground), it resides
inside the rocket's guidance computer(s), and for the Titan family of launch vehicles,
the code is exited at t=0 (liftoff). This code is designed with knowing that the
rocket is rotating on the surface of the earth, and the algorithms expect only very
mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff).
Well, the French do things a little differently (but probably now they don't). The
Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
past liftoff. They do (did) this in case there are any unanticipated holds in the
countdown right close to liftoff. In this way, this position and velocity updating
code would *not* have to be reset if they could get off the ground within just a few
seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad,
because at about 30 secs, it was pulling some accelerations that caused floating pount
overflows in the still functioning ground software. The actual flight software (which
was also running, naturally) was computing the positions and velocities that were
being used to actually fly the rocket, and it was doing just fine - no overflow errors
there because it was designed to expect high accelerations. There are two flight
computers on the Ariane 5 - a primary and a backup - and each was designed to shut
down if an error such as a floating point overflow occurred, thinking that the other
one would take over. Both computers were running the ground software, and both
experienced the floating point errors. Actually, the primary went belly-up first, and
then the backup within a fraction of a second later. With no functioning guidance
computer on board, well, ka-boom as you say.

Apparently the Ariane 4 gets off the ground with smaller accelerations than the 5, and
this never happened with a 4. You might take note that this would never happen with a
Titan because we don't execute this ground software after liftoff. Even if we did, we
would have caught the floating point overflows way before launch because we run all
code in what's called "Real-Time Simulations" where actual flight harware and software
are subjected to any and all known physical conditions. This was another finding of
the investigation board - apparently the French don't do enough of this type of
testing because it's real expensive. Oh well, they probably do now!

--
Clear skies,
Steve Bell
sb...@delphi.com
http://people.delphi.com/sb635 - Astrophoto page

Ken Garlington

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Matthew Heaney wrote:
>
> It's never cost effective to skimp on hardware if it means human
> programmers have to write more complex software.

Never say never!

Take an example of building, say, 1,000 units of some widget containing 1,500
source lines of code. For ease of calculation, assume $100/workerhour, 150
w-hours/w-month.

1. Buy a CPU at $50/unit which will do the job, but will cause the software
development team to spend 10 w-months to complete the task, and will cause
the post-deployment cost to be 2x the development cost.

10 w-months to complete the original development is $100 x 150 x 10 = $150,000.
Maintenance is $300,000. Total software cost per unit (amortized over several
years, possibly, for the maintenance): $450/unit.

2. Buy a CPU at $300/unit which will do the job, and because it's so modern,
the software development team only needs 5 w-months to complete the task, and
the post-deployment cost is only 1x the development cost. In other words, the
software development time is cut in half (the standard promise for such
improvements). So, software cost: $225/unit.

Assuming I did my math right, I'd be buying some cheap hardware right about now...

Ken Garlington

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Alan Brain wrote:
>
> Ken Garlington wrote:
>
> > So what did you do when you needed to build a system that was bigger than the
> > torpedo hatch? Re-design the submarine?
>
> Nope, we re-designed the system so it fit anyway.

Tsk, tsk! You violated your own design constraint of "always provide enough
margin for growth." Just think how much money you would have saved if you had
built it bigger to begin with!

> Actually, we designed
> the thing in the first place so that the risk of it physically growing
> too big and needing re-design was tolerable (ie contingency money was
> allocated for doing this, if we couldn't accurately estimate the risk as
> being small).

I'm sure the Arianespace folks had the same contingency funding. In fact, they're
spending it right now. :)

>
> > Oh for the luxury of a diesel generator! We have to be able to operate on basic
> > battery power (and we share that bus with emergency lighting, etc.)
>
> Well ours had a generator connected to a hamster wheel with a piece of
> cheese as backup ;-).... but seriously folks, yes we have a diesel. Why?
> to charge the batteries.

Batteries, plural? Wow!

> I'd be very, very suspicious of a slack like "15%". This implies you
> know to within 2 significant figures what the load is going to be. Which
> in my experience is not the case. "About a Seventh" is more accurate, as
> it implies more imprecision. And I'd be surprised if any Bungee-Jumper
> would tolerate that small amount of safety margin using new equipment.
> Then again, slack is supposed to be used up. It's for the unforeseen.
> When you come across a problem during development, you shouldn't be
> afraid of using up that slack, that's what it's there for!

Actually, no. For most military programs, slack is for a combination of
growth _after_ the initial development, or for unforseen variations in
the production system (e.g., a processor that's a little slower than spec.)
And, 15% is a common number for such slack.

I think you're confusing "slack" with "management reserve," which is usually
an number set by the development organization and used up (if needed) during
development. The 15% number is usually imposed by a prime on a subcontractor
for the reasons described above.

> > What if your brand new CPU requires more power than your diesel generator
> > can generate?
> > What if your brand new CPU requires a technology that doesn't let you meet
> > your heat dissipation?
>
> But it doesn't. When you did your initial systems engineering, you made
> sure there was enough slack - OR had enough contingency money so that
> you could get custom-built stuff.

How much money is required to violate the laws of physics? _That's_ the
kind of limitations we're talking about when you get into power, cooling,
heat dissipation, etc.

> I see your zero-cooling situations, and I raise you H2, CO2, CO, Cl, H3O
> conditions etc. The constraints on a sub are different, but the same in
> scope. Until such time as you do work on a sub, or I do more than just a
> little work on aerospace, we may have to leave it at that.

But we _already_ have these same restrictions, since we have to operate in
Naval environments. We also have _extra_ requirements.

Considering that the topic of this thread is an aerospace system, I think
it's not enough to "leave it at that."

>
> > > Usually such ridiculously extreme measures are not neccessary. The
> > > Hardware guys
> > > bitch about the cost-per-CPU going through the roof. Heck, it could cost
> > > $10 million.
> > > But if it saves 2 years of Software effort, that's a net saving of $90
> > > million.
> >
> > What does maintenance costs have to do with this discussion?
>
> Sorry I didn't make myself clear: I was talking development costs, not
> maintenance.

Then you're not talking about inertial nav systems. On most of the projects
I've seen, the total software development time is two years or less. You're
not going to save 2 years of software effort for a new system!

> > If you're used to developing systems
> > with those kind of constraints, you know how to make those decisions.
> > Occasionally, you make the wrong decision, as the Ariane designers discovered.
> > Welcome to engineering.
>
> My work has only killed 2 people (Iraqi pilots - that particular system
> worked as advertised in the Gulf). There might be as many as 5000 people
> whose lives depend on my work at any time, more if War breaks out. I
> guess we have a different view of "acceptable losses" here, and your
> view may well be more correct.

You're misisng the point. It's not a question as to whether it's OK for the
system to fail. It's a question of humans having to make decisions that
don't include "well, if we throw enough money at it, we'll get everything we
want." You cannot optimize software development time and ignore all other
factors! In some cases, you have to compromise software development/maintenance
efficiencies to meet other requirements. Sometimes, you make the wrong
decision. Anyone who says they've always made the right call is a lawyer, not
an engineer.

> Why? Because such a conservative view as
> my own may mean I just can't attempt some risky things. Things which
> your team (sometimes at least) gets working, teherby saving more lives.

However, if you build a system with the latest and greatest CPU, thereby
having the maximum amount of horsepower to permit the software engineers
to avoid turning off certain checks, etc., you _have_ attempted a risky
thing. The latest hardware technology is the least used.

> Yet I don't think so.
>
> > And, if you had only got 20MB per second after all that, you would have
> > done...?
>
> 20 MB? First, re-check all calculations. Examine hardware options. Then
> (probably) set up a "get-well" program using 5-6 different tracks and
> pick the best. Most probably though, we'd give up: it's not doable
> within the budget.

That's the difference. We would not go to our management and say, "The
only solutions we have require us to make compromises in our software
approach, therefore it can't be done. Take your multi-billion project
and go home." We'd work with the other engineering disciplines to come
up with the best compromise. It's the difference, in my mind, between a
computer scientist and a software engineer. The software engineer is paid
to find a way to make it work -- even if (horrors) he has to write it in
assembly, or use Unchecked_Conversion, or whatever.

The difficult case is 150 MB. In this case, assembler
> coding might just make the difference - I do get your point, BTW.
>
> > Certainly, if you just throw out range checking without knowing its cost,
> > you're an idiot. However, no one has shown that the Ariane team did this.
> > I guarantee you (and am willing to post object code to prove it) that
> > range checking is not always zero cost, and in the right circumstances can
> > cause you to bust your budget.
>
> Agree. There's always pathological cases where general rules don't
> apply. Being fair, I didn't say "zero cost", I said "typically 5%
> measured". In doing the initial Systems work, I'd usually budget for
> 10%, as I'm paranoid.

I've seen checks in just the wrong place that cause differences in 30% or
more in a high-rate process. It's just not that trivial.

> You get what you pay for, IF you're lucky. My point though is that many
> of the hacks, kludges etc in software are caused by insufficient
> foresight in systems design.

And I wouldn't argue that. However, it's a _big_ leap to say ALL hacks
are caused by such problems. Also, having gone through the system design
process a few times, I've never had "sufficient foresight." There's always
been at least one choice I made then that I would have made differently today.
(Why didn't I see the obvious answer in 1985: HTML for my documentation! :)

That's why reuse is always so tricky in safety-critical systems. It's very
easy to make reasonable decisions then that don't make sense now. That's
why at laugh at people who say, "reused code is safer; you don't have to
test it once you get it working once!"

> Case in point: RAN Collins class submarine.
> Now many years late due to software problems. Last time I heard, they're
> still trying to get that last 10% performance out of the 68020s on the
> cards. Which were leading-edge when the systems work was done. Putting
> in 68040s a few years ago would have meant the Software would have been
> complete by now, as the hacks wouldn't have been neccessary.

68040s? I didn't think you could get mil-screened 68040s anymore. They're
already obsolete.

Not easy to make those foresighted decisions, is it? :)

Ken Garlington

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Matthew Heaney wrote:
>
> Buyers of mission-critical software should think very carefully before they
> commit any financial resources to implementing a software system that
> requires checks be turned off. I'd say take your money instead to Las
> Vegas: your odds for success are better there.

Better not drive or fly there: more than likely, the software systems running
in your car, plane, etc. are written in a language without any built-in
support for checks.

Checks are not a magic wand. They do not inherently make systems safer. What
matters is how you use the checks. If your ABS software fails in the middle
of winter, printing out a stack dump is not going to make you much safer!

Joseph C Williams

unread,

Oct 4, 1996, 3:00:00 AM10/4/96

to

Why didn't they run the code against an Ariane 5 simulator to
reverify the Ariane 4 software what was reused? A good real-time
engineering simulation would have caught the problem.

Alan Brain

unread,

Oct 5, 1996, 3:00:00 AM10/5/96

to

Robert S. White wrote:
>
> In article <mheaney-ya0231800...@news.ni.net>, mhe...@ni.net
> says...
>

> >It's never cost effective to skimp on hardware if it means human
> >programmers have to write more complex software.
>

> Not if the ratio is tilted very heavy towards reoccuring cost versus
> Non-Reoccuring Engineering (NRE). How about 12 staff-months versus $300 extra
> hardware cost on 60,000 units?

$300 extra on 60,000 units. That's $18 Million, right?

vs

12 Staff-months. Now if your staff is 1, then that's maybe $200,000 for
a single top-notch profi. If your staff is 200, each at 100,000 cost (ie
average wage is about 50K/year), then that's 20 million. But say you
only have the one guy. And say it adds 50% to the risk of failure. With
consequent and liquidated damages of 100 Million. Then that's 50
million, 200 thousand it's really costing.

Feel free to make whatever strawman case you want. The above figures are
based on 2 different projects (actually the liquidated damages one
involved 6 people, rather than 1, and an estimated 70% increased chance
of failure, but I digress).

Summary: In the real world, and with the current state-of-the-art, I
agree with the original statement as an excellent general rule.

Robert Dewar

unread,

Oct 5, 1996, 3:00:00 AM10/5/96

to

iRobert White said

">It's never cost effective to skimp on hardware if it means human
>programmers have to write more complex software.

Not if the ratio is tilted very heavy towards reoccuring cost versus
Non-Reoccuring Engineering (NRE). How about 12 staff-months versus $300 extra
hardware cost on 60,000 units?"

Of course this is true at some level, but the critical thing is that a proper
cost comparison here must take into account:

a) full life cycle costs of the software, not just development costs
b) time-to-market delays caused by more complex software
c) decreased quality and reliability caused by more complex software

There are certainly cases where careful consideration of these three factors
still results in a decision to use less hardware and more complex software,
but I think we have all seen cases where such decisions were made, and n
in retrospect turned out to be huge mistakes.

Robert Dewar

unread,

Oct 5, 1996, 3:00:00 AM10/5/96

to

Matthew said

"> Buyers of mission-critical software should think very carefully before they
> commit any financial resources to implementing a software system that
> requires checks be turned off. I'd say take your money instead to Las
> Vegas: your odds for success are better there."

To the extent that checks are used for catching hardware failures this might
be true, but in practice the runtime checks of Ada are not a well tuned
tool for this purpose, although I have seen programs that work hard to take
more advantage of such checks. For example:

type My_Boolean is new Boolean;
for My_Boolean use (2#0101#, 2#1010#);

so that 1 bit errors cannot give valid Boolean values (check and see if your
compiler supports this, it is not required to do so!)

However, to the extent that checks are used to catch programming errors,
I think that I would prefer that a safety critical system NOT depend on
such devices. A programming error in a checks on program may indeed result
in a constraint error, but it may also cause the plane to dive into the sea
without raising a constraint error.

I find the second outcome here unacceptable, so the methodology must simply
prevent such errors completely. Indeed if you look at safety critical
subsets for Ada they often omit exceptions precisely because of this
consideration. After all exceptions make the language and compiler more
complex, and that itself may introduce concerns at the safety critical
level.

Note also that exceptions are a double edged sword. An exception that is
not handled properly can be much worse than no exception at all. If you
have a section of code doing non-critical calculations (e.g. how much
time to wait before showing the movie in the main cabin), it really does
not matter too much if that calculation overflows and shows the movie a
bit early, but if it causes an unhandled exception that wipes out the
entire passenger control system, and turns out all the reading lights
etc. that can be much worse. Even in a safety critical system, there will
be calculations that are relatively unimportant.

For example, a low priority task may cause an overflow. If ignored, an
unimportant result is simply wrong. if not ignored, the handoling of the
exception mayh cause that low priority task to overrun its CPU slice, and
cause chaos elsewhere.

As Ken says, checks are not a magic wand. They are a powerful tool, but
like any tool, subject to abuse. A chain saw with a kickback guard on the
end is definitely a safer tool to use, especially for an amateur, than
one without (something I appreciate while clearing paths through the woods
at my Vermont house), but it does not mean that now the tool is a completely
safe one, and indeed a real expert with a chain saw will often feel that it
is safer to operate without the guard, because then the behavior of the
chainsaw is simpler and more predictable.

Robert S. White

unread,

Oct 6, 1996, 3:00:00 AM10/6/96

to

In article <3256ED...@dynamite.com.au>, aeb...@dynamite.com.au says...

>
>12 Staff-months. Now if your staff is 1, then that's maybe $200,000 for
>a single top-notch profi. If your staff is 200, each at 100,000 cost (ie
>average wage is about 50K/year), then that's 20 million.

Number of staff * amount of time = staff months
(with a dash of reality for reasonable parallel tasks)

The type of strawman that I had in mind could be 1 person for a year, two
persons for six months, to a limit of 4 persons for 3 months. And watch out
for the mythical man-machine month!

> But say you
>only have the one guy. And say it adds 50% to the risk of failure. With
>consequent and liquidated damages of 100 Million. Then that's 50
>million, 200 thousand it's really costing.

Projects these days also have a "Risk Management Plan" per SEI CMM
recomendations. That 50% to the risk of failure has to be assigned a estimated
cost and factored in the decision.

>Feel free to make whatever strawman case you want. The above figures are
>based on 2 different projects (actually the liquidated damages one
>involved 6 people, rather than 1, and an estimated 70% increased chance
>of failure, but I digress).

I've seen a lot of successes. Failures most often can be attributed to poor
judgement by incompetent personnel. That can be tough to manage when the
managers don't want to hear bad news or risk projections. Especially when they
set up a project and move on before it is done.

>
>Summary: In the real world, and with the current state-of-the-art, I
>agree with the original statement as an excellent general rule.

I beg to disagree in the case of higher volume markets. I do agree very much
for lower volumes or when the type of development task is new to the engineers
and managers. You must have a good understanding of the problem and the
solution domain to do proper cost tradeoffs that involve significant risk.

Keith Thompson

unread,

Oct 6, 1996, 3:00:00 AM10/6/96

to

In <dewar.844518011@schonberg> de...@schonberg.cs.nyu.edu (Robert Dewar) writes:
> To the extent that checks are used for catching hardware failures this might
> be true, but in practice the runtime checks of Ada are not a well tuned
> tool for this purpose, although I have seen programs that work hard to take
> more advantage of such checks. For example:
>
> type My_Boolean is new Boolean;
> for My_Boolean use (2#0101#, 2#1010#);
>
> so that 1 bit errors cannot give valid Boolean values (check and see if your
> compiler supports this, it is not required to do so!)

But then there's still no guarantee that an invalid Boolean value will
be detected. The code generated for an if statement, for example, is
unlikely to check its Boolean condition for validity.

Of course, there will probably be a runtime call to a routine that
converts from My_Boolean to Boolean, and this routine will *probably*
do something sensible (raise Program_Error) for an invalid argument.

Anyone something this tricky presumably is already examining the generated
code to make sure there are no surprises.

--
Keith Thompson (The_Other_Keith) k...@thomsoft.com <*>
TeleSoft^H^H^H^H^H^H^H^H Alsys^H^H^H^H^H Thomson Software Products
10251 Vista Sorrento Parkway, Suite 300, San Diego, CA, USA, 92121-2706
FIJAGDWOL

Robert S. White

unread,

Oct 6, 1996, 3:00:00 AM10/6/96

to

In article <dewar.844517570@schonberg>, de...@schonberg.cs.nyu.edu says...

>There are certainly cases where careful consideration of these three factors
>still results in a decision to use less hardware and more complex software,
>but I think we have all seen cases where such decisions were made, and n
>in retrospect turned out to be huge mistakes.

In business when making hard decisions about embedded systems products,
such studies are almost always made. One factor now causing a lot of study
and implementation of "complex software" are efforts to reduce procedure call
overhead and virtual dispatching for object oriented high level languages
that might be used for embedded products. The extra thruput and memory
required does end up using more memory chips and faster (more power
hungary) processors. This is a major problem when power consumption, size and
cost requirements are taken into consideration. Engineering has to look at
the entire picture.

We want to use the latest software technology that results in the cleanest
most easy to maintain design. Sometimes it takes a while to hone the tools
that use this new technology till they are ready for prime time in mission
critical software that has to operate in an enviroment with a lot of other
constraints. Ada 83 in 1984 had a lot of problems in implementations that
were mostly solved by 1991. Ada 95 (with advantage taken of its new object
oriented features) and Java bytecode virtual machines also need a significant
amount of effort expended till they are ready for these more constrained
embedded products. Not to say that it won't happen, just that they often
don't pass the rigorous cost/feature analysis tradeoffs at this date for
immediate product implementation.

And we NEVER want to make a mistake for flight critical software :-<
or have a critical task not able to run to completion within its deadlines.
We need that processor thruput reserve to have a safety margin for rate
monotonic tasks.

I agreed with most everything Ken Garlington and other Lockheed Martin
engineers have posted in this thread on this same subject. Their statements
ring with my industry experience for the last 21 years.

Wayne Hayes

unread,

Oct 6, 1996, 3:00:00 AM10/6/96

to

In article <32551A...@gsde.hso.link.com>,

Joseph C Williams <u6...@gsde.hso.link.com> wrote:
>Why didn't they run the code against an Ariane 5 simulator to
>reverify the Ariane 4 software what was reused?

Money. (The more cynical among us may say this translates to "stupidity".)

--
"Unix is simple and coherent, but it takes || Wayne Hayes, wa...@cs.utoronto.ca
a genius (or at any rate, a programmer) to || Astrophysics & Computer Science
appreciate its simplicity." -Dennis Ritchie|| http://www.cs.utoronto.ca/~wayne

Ken Garlington

unread,

Oct 7, 1996, 3:00:00 AM10/7/96

to

Steve Bell wrote:

> Well, the French do things a little differently (but probably now they don't). The
> Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
> past liftoff. They do (did) this in case there are any unanticipated holds in the
> countdown right close to liftoff. In this way, this position and velocity updating
> code would *not* have to be reset if they could get off the ground within just a few
> seconds of nominal.

But why 40 seconds? Why not 1 second (or one millisecond, for that matter)?

> You might take note that this would never happen with a
> Titan because we don't execute this ground software after liftoff. Even if we did, we
> would have caught the floating point overflows way before launch because we run all
> code in what's called "Real-Time Simulations" where actual flight harware and software
> are subjected to any and all known physical conditions. This was another finding of
> the investigation board - apparently the French don't do enough of this type of
> testing because it's real expensive.

Going way back into my history, I believe this is also true for Atlas.

> --
> Clear skies,
> Steve Bell
> sb...@delphi.com
> http://people.delphi.com/sb635 - Astrophoto page

--

Alan Brain

unread,

Oct 8, 1996, 3:00:00 AM10/8/96

to

Robert Dewar wrote:

> However, to the extent that checks are used to catch programming errors,
> I think that I would prefer that a safety critical system NOT depend on
> such devices. A programming error in a checks on program may indeed result
> in a constraint error, but it may also cause the plane to dive into the sea
> without raising a constraint error.

A good point. Unless you use multiple exception handlers at all levels
of the program, it might be better to leave the bug to fester. I guess
you've really got to know what you're doing. Fortunately the concept of
making sure exceptions are trapped everywhere is simple.

> I find the second outcome here unacceptable, so the methodology must simply
> prevent such errors completely. Indeed if you look at safety critical
> subsets for Ada they often omit exceptions precisely because of this
> consideration. After all exceptions make the language and compiler more
> complex, and that itself may introduce concerns at the safety critical
> level.

I don't believe in bug-free code. Even if it is bug-free, there's always
the miniscule chance of hardware and soft failures. Now I'm not saying
that error-trapping will save you every time from every failure: but a
"layered defence" of 40 layers, each of which only catches 50% of
errors, is one heck of a lot more effective than one layer that works
99.99999% of the time.

> Note also that exceptions are a double edged sword. An exception that is
> not handled properly can be much worse than no exception at all.

Concur.

> For example, a low priority task may cause an overflow. If ignored, an
> unimportant result is simply wrong. if not ignored, the handoling of the
> exception mayh cause that low priority task to overrun its CPU slice, and
> cause chaos elsewhere.

And not handling it may cause the same thing, or even violate memory
constraints etc. No Handling means no chance of tolerating such an
error: The ability to handle PLUS someone who knows how to use
exceptions means there is such a chance.

> As Ken says, checks are not a magic wand. They are a powerful tool, but
> like any tool, subject to abuse.

Concur. My conclusion though is that checks are better than no-checks;
even if the person using them is realtively incompetent. Trouble is, I
only have anecdotal evidence of this, not a proper study.

@@ robin

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

mola...@ifremer.fr (Michel OLAGNON) writes:

>In article <532k32$r...@goanna.cs.rmit.edu.au>, r...@goanna.cs.rmit.edu.au (@@ robin) writes:
>> jo...@assen.demon.co.uk (John McCabe) writes:
>>

>> >Just a point for your information. From clari.tw.space:
>>
>> > "An inquiry board investigating the explosion concluded in
>> >July that the failure was caused by software design errors in a
>> >guidance system."
>>
>> >Note software DESIGN errors - not programming errors.
>>

>| Michel OLAGNON email : Michel....@ifremer.fr|

But if you read further on ....

"However, three of the variables were left unprotected. No reference to
justification of this decision was found directly in the source code. Given
the large amount of documentation associated with any industrial
application, the assumption, although agreed, was essentially obscured,
though not deliberately, from any external review."

.... you'll see that there was no documentation in the code to
explain why these particular 3 (dangerous) conversions were
left unprotected. There is the implication that one or more
of them might have been overlooked . . . .. Don't place
too much reliance on the conclusion of the report, when
the detail is right there in the body of the report.

@@ robin

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

Steve Bell <sb...@delphi.com> writes:

>Michael Dworetsky wrote:
>>
>> >Just a point for your information. From clari.tw.space:
>> >
>> > "An inquiry board investigating the explosion concluded in
>> >July that the failure was caused by software design errors in a
>> >guidance system."
>> >
>> >Note software DESIGN errors - not programming errors.
>> >
>>

>> Indeed, the problems were in the specifications given to the programmers,
>> not in the coding activity itself. They wrote exactly what they were
>> asked to write, as far as I could see from reading the report summary.
>>
>> The problem was caused by using software developed for Ariane 4's flight
>> characteristics, which were different from those of Ariane 5. When the
>> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
>> it send an error message and, as specified by the remit given to
>> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
>>

>I work for an aerospace company, and we recieved a fairly detailed accounting of what
>went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch
>pad, run a guidance program that updates their position and velocity in reference to
>an coordinate frame whose origin is at the center of the earth (usually called an
>Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4
>hours before launch and is allowed to run all the way until liftoff, so that the
>rocket will know where it's at and how fast it's going at liftoff. Although called
>"ground software," (because it runs while the rocket is on the ground), it resides
>inside the rocket's guidance computer(s), and for the Titan family of launch vehicles,
>the code is exited at t=0 (liftoff). This code is designed with knowing that the
>rocket is rotating on the surface of the earth, and the algorithms expect only very
>mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff).

>Well, the French do things a little differently (but probably now they don't). The
>Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
>past liftoff. They do (did) this in case there are any unanticipated holds in the
>countdown right close to liftoff. In this way, this position and velocity updating
>code would *not* have to be reset if they could get off the ground within just a few

>seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad,
>because at about 30 secs, it was pulling some accelerations that caused floating pount
>overflows

---Definitely not. No floating-point overflow occurred. In
Ariane 5, the overflow occurred on converting a double-precision
(some 56 bits?) floating-point to a 16-bit integer (15
significant bits).

That's why it was so important to have a check that the
conversion couldn't overflow!

in the still functioning ground software. The actual flight software (which
>was also running, naturally) was computing the positions and velocities that were
>being used to actually fly the rocket, and it was doing just fine - no overflow errors
>there because it was designed to expect high accelerations. There are two flight
>computers on the Ariane 5 - a primary and a backup - and each was designed to shut
>down if an error such as a floating point overflow occurred,

---Again, not at all. It was designed to shut down if any interrupt
occurred. It wasn't intended to be shut down for a routine thing as
a conversion of floating-point to integer.

thinking that the other
>one would take over. Both computers were running the ground software, and both
>experienced the floating point errors.

---No, the backup SRI experienced the programming error (UNCHECKED
CONVERSION from floating-point to integer) first, and shut itself
down, then the active SRI computer experienced the same programming
error, then it shut itself down.

Steve O'Neill

unread,

Oct 9, 1996, 3:00:00 AM10/9/96

to

@@ robin wrote:
> ---Definitely not. No floating-point overflow occurred. In
> Ariane 5, the overflow occurred on converting a double-precision
> (some 56 bits?) floating-point to a 16-bit integer (15
> significant bits).
>
> That's why it was so important to have a check that the
> conversion couldn't overflow!

> Agreed. Yes, the basic reason for the destruction of a billion dollar
vehicle was for want of a couple of lines of code. But it relects a
systemic problem much more damaging than what language was used.

I would have expected that in a mission/safety critical application
the proper checks would have been implemented, no matter what. And in a
'belts-and-suspenders' mode I would also expect an exception handler to
take care of unforeseen possibilities at the lowest possible level and
raise things to a higher level only when absolutely necessary. Had these
precautions been taken there would probably be lots of entries in an
error log but the satellites would now be orbiting.

As outsiders we can only second guess as to why this approach was not
taken but the review board implies that 1) the SRI software developers
had an 80% max utilization requirement and 2) careful consideration
(including faulty assumptions) was used in deciding what to protect and
not protect.

>It was designed to shut down if any interrupt occurred. It wasn't ^^^^^^^^^ exception, actually

>intended to be shut down for a routine thing as a conversion of
>floating-point to integer.

This was based on the (faulty) system-wide assumption that any exception
was the result of a random hardware failure. This is related to the
other faulty assumption that "software should be considered correct until
is proven to be at fault". But that's what the specification said.

> ---No, the backup SRI experienced the programming error (UNCHECKED
> CONVERSION from floating-point to integer) first, and shut itself
> down, then the active SRI computer experienced the same programming
> error, then it shut itself down.

Yes, according to the report the backup died first (by 0.05 seconds).
Probably not as a result of an unchecked_conversion though - the source
and target are of different sizes which would not be allowed. Most
likely just a conversion of a float to an sixteen-bit integer. This
would have raised a Constraint_Error (or Operand_Error in this
environment). This error could have been handled within the context of
this procedure (and the mission continued) but obviously was not.
Instead it appears to have been propagated to a global exception handler
which performed the specified actions admirably. Unfortunately these
included committing suicide and, in doing so, dooming the mission.

--
Steve O'Neill | "No,no,no, don't tug on that!
Sanders, A Lockheed Martin Company | You never know what it might
smon...@sanders.lockheed.com | be attached to."
(603) 885-8774 fax: (603) 885-4071| Buckaroo Banzai

Ken Garlington

unread,

Oct 10, 1996, 3:00:00 AM10/10/96

to

Robert S. White wrote:
>
> In article <dewar.844517570@schonberg>, de...@schonberg.cs.nyu.edu says...
>
> >There are certainly cases where careful consideration of these three factors
> >still results in a decision to use less hardware and more complex software,
> >but I think we have all seen cases where such decisions were made, and n
> >in retrospect turned out to be huge mistakes.
>
> In business when making hard decisions about embedded systems products,
> such studies are almost always made.

In my experience, both statements are true. Such studies are often made, and I have
also seen cases where they weren't made, or were done poorly. I have no idea what the
percentages are for:

* the study was done correctly
* the study was not done correctly (or not done at all), but the decision turned
out to be right anyway
* the study was done incorrectly, the answer was wrong, and no one ever discovered
it was wrong (because no one ever looked at the final cost, etc.)
* the study was done incorrectly, the answer was wrong, someone found out it was
wrong, but didn't broadcast it to the general public (would you?)

Overall, I'd say we'll never know. As a colleague of mine said: "How many systems out
there have a bug like the Ariane 5, but just never hit that magic condition where the
bug caused a failure?" Just think: A little less acceleration on takeoff, and we'd
think Arianespace made a wonderful decision by reusing the Ariane 4 -- look at all the
money they saved! It might have been mentioned in the Reuse News as a major success :)

I've got a few minutes, so I'll mention another of my favorite themes at this point
(usually stated in the context of preparing Ada waivers): It's really hard to
determine the life-cycle cost of software, particularly over a long period (e.g. 20
years). There are cost models; sometimes, we even get the parameters right and the
model comes up with the right answer. Nonetheless, it's tough to consider life-cycle
costs objectively. That's not an excuse for failing to try, but an acknowledgement
that it's easy to get it wrong (particularly for new technology).

Software engineering can be _so_ depressing!

Ken Garlington

unread,

Oct 10, 1996, 3:00:00 AM10/10/96

to

Robert Dewar wrote:
>
> I find the second outcome here unacceptable, so the methodology must simply
> prevent such errors completely. Indeed if you look at safety critical
> subsets for Ada they often omit exceptions precisely because of this
> consideration. After all exceptions make the language and compiler more
> complex, and that itself may introduce concerns at the safety critical
> level.

I'm also starting to be convinced, after some anecdotal evidence with the systems
I work, that _suppressing_ checks can also make the compiler more fragile. My guess is
that fewer people in general suppress all checks for most compilers, so those
paths in the compiler that run with checks suppressed are used less often,
and so they have a higher probability of containing bugs. I also suspect that most
vendors do not run their standard tests suites (including ACVCs) with checks
suppressed (how could you, for the part of the test suite that validates exception
raising and handling?), so there's less coverage from that source as well.

I'm not saying that it's dumb to suppress checks (or not suppress checks) for
safety-critical systems. I'm just saying the answer appears to be a lot more
complicated than I thought it was 10 years ago (or even 2 years ago).

Alan Brain

unread,

Oct 12, 1996, 3:00:00 AM10/12/96

to

Steve O'Neill wrote:

> I would have expected that in a mission/safety critical application
> the proper checks would have been implemented, no matter what. And in a
> 'belts-and-suspenders' mode I would also expect an exception handler to
> take care of unforeseen possibilities at the lowest possible level and
> raise things to a higher level only when absolutely necessary. Had these
> precautions been taken there would probably be lots of entries in an
> error log but the satellites would now be orbiting.

Concur completely. This should be Standard Operating Procedure, a matter
of habit. Frankly, it's just good engineering practice. But is honoured
more in the breach than the observance it seems, because....

> As outsiders we can only second guess as to why this approach was not
> taken but the review board implies that 1) the SRI software developers
> had an 80% max utilization requirement and 2) careful consideration
> (including faulty assumptions) was used in deciding what to protect and
> not protect.

... as some very reputable people, working for very reputable firms have
tried to pound into my thick skull, they are used to working with 15%,
no
more, tolerances. And with diamond-grade Hard Real Time slices, where
any
over-run, no matter how slight, means disaster. In this case, Formal
Proof
and strict attention to the no of CPU cycles in all possible paths seems
the only way to go.
But this leaves you so open to error in all but the simplest, most
trivial
tasks, (just the race analysis would be nightmarish) that these slices
had
better be a very small part of the task, or the task itself must be very
simple indeed. Either way, not having much bearing on the vast majority
of
problems I've encountered.
If the tasks are not simple....then can I please ask the firms concerned
to
tell me which aircraft their software is on, so I can take appropriate
action?

Matthew Heaney

unread,

Oct 14, 1996, 3:00:00 AM10/14/96

to

In article <dewar.844518011@schonberg>, de...@schonberg.cs.nyu.edu (Robert
Dewar) wrote:

>As Ken says, checks are not a magic wand. They are a powerful tool, but

>like any tool, subject to abuse. A chain saw with a kickback guard on the
>end is definitely a safer tool to use, especially for an amateur, than
>one without (something I appreciate while clearing paths through the woods
>at my Vermont house), but it does not mean that now the tool is a completely
>safe one, and indeed a real expert with a chain saw will often feel that it
>is safer to operate without the guard, because then the behavior of the
>chainsaw is simpler and more predictable.

I think we're all in basic agreement.

As you stated, exceptions are only a tool. They don't replace the need for
(mental) reasoning about the correctness of my program, nor should they be
used to guard against sloppy programming. Exceptions don't correct the
problem for you, but at least they let you know that a problem exists.

And in spite of all the efforts of the Ariane 5 developers, a problem did
exist, significant enough to cause mission failure. Don't you think an
exception was justified in this case?

Yes, I agree that there may be times when you don't need any sophisticated
exception handling, and you could safely turn checks off. But surely there
are important sections of code, say for a critical algorithm, that justify
the use of checks.

Believe me, I would love to write a software system that I knew were
(formally) correct and didn't require run-time checks. But I am not able
to build that system today. So what should I do?

Though I may be the most practiced walker of tightropes, I still like
having that safety net underneath me.

-matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mhe...@ni.net
(818) 985-1271

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 14, 1996, 3:00:00 AM10/14/96

to

Alan Brain <aeb...@DYNAMITE.COM.AU> writes:
>more, tolerances. And with diamond-grade Hard Real Time slices, where
>any
>over-run, no matter how slight, means disaster. In this case, Formal
>Proof
>and strict attention to the no of CPU cycles in all possible paths seems
>the only way to go.
>But this leaves you so open to error in all but the simplest, most
>trivial
>tasks, (just the race analysis would be nightmarish) that these slices
>had
>better be a very small part of the task, or the task itself must be very
>simple indeed. Either way, not having much bearing on the vast majority
>

In my experience with this sort of "Hard Real Time" code, you are
typically talking about relatively straightforward code - albeit
difficult to develop. (Ask A. Einstein how long it took him to
write the "E := M * C**2 ;" function.)

The parts which typically have hard deadlines tend to be heavy on
math or data motion and rather light on branching and call chain
complexity. You want your "worst case" timing to be your nominal
path and you'd like for it to be easily analyzed and very
predictable. Usually, it's a relatively small part of the system
and maybe (MAYBE!) you can turn off runtime checks for just this
portion of the code, leaving it in for the things which run at a
lower duty cycle.

Of course the truly important thing to remember is that compiler
generated runtime checks are not a panacea. They *may* have helped
with the Ariane 5, if there was an appropriate accommodation once
the error was detected. (Think about it. If the accommodation was
"Shut down the channel and pass control to the other side" {Very
common in a dual-redundant system} it would have made no
difference.) But most of the errors I've encountered in realtime
systems have been of the "logic" variety. ("Gee! We thought 'x'
was the proper course of action when this condition comes up and
really it should have been 'y'" or "I didn't know the control
would go unstable if parameter 'x' would slew across its range
that quickly!?!?!") Runtime checks aren't ever going to save us
from that sort of mistake - and those are the ones which show up
most often. (Unless, of course, you program in C ;-)

An aside which has something to do with Ada language constructs:
In most of our work (control systems) it would be far more useful
for math over/underflows to saturate and continue on, rather than
raise an exception and halt processing. Ada never defined any
numeric types with this sort of behavior - and I find it difficult
to believe that many others in similar embedded applications
wouldn't also desire this behavior from some predefined floating,
fixed, and integer types. Of course, the language allows us to
define our own types and (if there's proper hardware and compiler
support for dealing with it) efficient "home-brew" solutions can
be built. Still, it would have seemed appropriate for the language
designers to have built some direct support for a very common
embedded need.

MDC

Marin David Condic, Senior Computer Engineer ATT: 561.796.8997
M/S 731-96 Technet: 796.8997
Pratt & Whitney, GESP Fax: 561.796.4669
P.O. Box 109600 Internet: COND...@PWFL.COM
West Palm Beach, FL 33410-9600 Internet: CON...@FLINET.COM
===============================================================================
"The speed with which people can change a courtesy into an
entitlement is awe-inspiring."

-- Miss Manners, February 8, 1994
===============================================================================

Robert Dewar

unread,

Oct 15, 1996, 3:00:00 AM10/15/96

to

Matthew says

"Believe me, I would love to write a software system that I knew were
(formally) correct and didn't require run-time checks. But I am not able
to build that system today. So what should I do?"

First of all, I would object to the "formally" and even the word "corect"
here. These are technical terms which relate to, but are not identical with,
the impoortant concept which is reliability.

It *is* possible to write reliable programs, though it is expensive. If you
need to do this, and are not able to do it, then the answer is to investigate
the tools that make this possible, and understand the necessary investment
(which is alarmingly high). Some of these tools are related to correctness,
but that's not the main focus. There are reliable incorect programs and
correct unreliable programs, and what we are interested in is reliability.

For an example of toolsets that help achieve this aim, take a look at the
Praxis tools. There are many other examples of methodologies and tools that
can be used to achieve high reliability.

Now of course informally we would like to make all programs realiable, but
there is a cost/benefit trade off. For most non-safety critical programming
(but not all), it is simply not cost effective to demand total reliability.

Robert I. Eachus

unread,

Oct 15, 1996, 3:00:00 AM10/15/96

to

In article <9610141...@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <cond...@PWFL.COM> writes:

> In most of our work (control systems) it would be far more useful
> for math over/underflows to saturate and continue on, rather than
> raise an exception and halt processing. Ada never defined any
> numeric types with this sort of behavior - and I find it difficult
> to believe that many others in similar embedded applications
> wouldn't also desire this behavior from some predefined floating,
> fixed, and integer types. Of course, the language allows us to
> define our own types and (if there's proper hardware and compiler
> support for dealing with it) efficient "home-brew" solutions can
> be built. Still, it would have seemed appropriate for the language
> designers to have built some direct support for a very common
> embedded need.

They did. First look at 'Machine_Overflows. It is perfectly
legal for even Float'Machine_Overflows to be false and the
implementation to return, say IEEE nonsignaling NaNs in such a case.
Also RM95 3.5.5(26) and 3.5.6(8) allow for nonstandard integer and
real types respectively, and mention saturation types as one possible
use for the feature.

Talk to your vendor or check out what GNAT actually does on your
hardware.

--

Robert I. Eachus

with Standard_Disclaimer;
use Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

Robert Dewar

unread,

Oct 15, 1996, 3:00:00 AM10/15/96

to

Marin said

" > In most of our work (control systems) it would be far more useful
> for math over/underflows to saturate and continue on, rather than
> raise an exception and halt processing. Ada never defined any
> numeric types with this sort of behavior - and I find it difficult
> to believe that many others in similar embedded applications
> wouldn't also desire this behavior from some predefined floating,
> fixed, and integer types. Of course, the language allows us to
> define our own types and (if there's proper hardware and compiler
> support for dealing with it) efficient "home-brew" solutions can
> be built. Still, it would have seemed appropriate for the language
> designers to have built some direct support for a very common
> embedded need."

Well there is always a certain kind of viewpoint that wants more, more, more
when it comes to features in a language, but I think that saturating types
would be overkill in terms of predefined integral types. Adding new classes
of integral types adds a lot of stuff to the language, just look at all the
stuff for supporting modular types.

I think a much more reasonable approach for saturating operators is to
define the necessary operators. If you need some very clever efficient
code for these operators, then either use inlined machine code, or persuade
your vendor to implement these as efficient intrinsics, that is always
allowed.

Michael F Brenner

unread,

Oct 16, 1996, 3:00:00 AM10/16/96

to

. Dewar said:

> I think that saturating types
> would be overkill in terms of predefined integral types. Adding new classes
> of integral types adds a lot of stuff to the language, just look at all the
> stuff for supporting modular types.
>
> I think a much more reasonable approach for saturating operators is to
> define the necessary operators. If you need some very clever efficient
> code for these operators, then either use inlined machine code, or persuade
> your vendor to implement these as efficient intrinsics, that is always
> allowed.

This hops on both sides of the horse at the same time. It was good to add
modular types into Ada-95, but it was bad to add a lot of stuff to the language.
It was an unnecessary management decision, not related to the technical
requirement for efficient modular types, to add anything to the language
Other Than clever, efficient, reliable code for modular operators. A different
management decision would have been to keep the way all Ada 83 compilers
with modular types did it, leaving them Represented as ordinary integers,
but overloading an alternate set of arithemetic operators over those ordinary
integers, so that conversion between twoUs complement and modular binary
would not require any code to be generated (except a possibly optimized
away copy of the integer). This would still permit inefficient BCD
implementations of modular arithmetic wherever the efficient hardware
operators are not available on a given target architecture.

Had this alternate decision been made, then R. DewarUs second half of the
comment could have been focussed on with more energy, namely, how can
more efficient code be generated for several different kinds of operators.
Solutions sometimes available are interfacing to assembler language and
inline machine code. Solutions available to those with larger than normal
amounts of funding include paying a compiler maintainer to implement
and efficient intrinsic function. But Another Way, for future consideration,
is to permit users to implement attributes or efficient intrinsic functions
by permitting pragmas which Demand certain performance requirements
of the generated code. As Dr. Dewar has repeatedly pointed out, performance
requirements are currently beyond the scope of the language definition.
However, many programs have performance requirements, and having
a way to specify them (in an Appendix) would not detract from the
language, but make it more useful in the realtime world. Examples of
such specifications include: (1) the topic of this thread (saturating overflows),
(2) do not generate code for a given instantiation of unchecked_conversion,
(3) do not even generate a copy for invocations of a given instantiation
of unchecked_conversion, (4) permit modular operations on an ordinary
user-defined range type, (5) use a particular run-time routine to implement
a particular array slice or Others initialization, (6) use a particular
machine code instruction to implement a particular array slice or Others
initialization, (7) truly deallocate a variable now, (8) truly deallocate
all variables of a given subtype now, (9) permit the use of all bits in
the word in a given fixed-point arithmetic type, etc.
.
ZZ

Ken Garlington

unread,

Oct 16, 1996, 3:00:00 AM10/16/96

to

Matthew Heaney wrote:
>
> As you stated, exceptions are only a tool. They don't replace the need for
> (mental) reasoning about the correctness of my program, nor should they be
> used to guard against sloppy programming. Exceptions don't correct the
> problem for you, but at least they let you know that a problem exists.
>
> And in spite of all the efforts of the Ariane 5 developers, a problem did
> exist, significant enough to cause mission failure. Don't you think an
> exception was justified in this case?

Not necessarily. Keep in mind that an exception _was_ raised -- a predefined
exception (Operand_Error according to the report). There was sufficient telemetry
to determine where the error occured (obviously, otherwise we wouldn't know what
happened!). If the real Ariane 5 trajectory had been tested in an integrated
laboratory enviroment, then (assuming the environment was realistic enough to
trigger the problem), the fault would have been seen (and presumably analyzed and
fixed) prior to launch. So, the issue is not the addition of a user-defined
exception to find the error -- the issue is the addition of a new exception
_handler_ to _recover_ from the error in flight.

Assuming that a new exception _handler_ had been added, then it _might_ have made
a difference. If it did nothing more than the system exception handler (shutting
down the channel), then the only potential advantage of the exception _handler_
might have been to allow fault isolation to happen faster (e.g. if the exception
were logged in some manner). This assumes that either the exception message was
sent out with the telemetry, or else the on-board fault logging survived the
crash. On the other hand, if it had shut down just the alignment function, then
it might have saved the system. Without more knowledge about the IRS
architecture, there's no way to say.

> Yes, I agree that there may be times when you don't need any sophisticated
> exception handling, and you could safely turn checks off. But surely there
> are important sections of code, say for a critical algorithm, that justify
> the use of checks.
>

> Believe me, I would love to write a software system that I knew were
> (formally) correct and didn't require run-time checks. But I am not able
> to build that system today. So what should I do?
>

> Though I may be the most practiced walker of tightropes, I still like
> having that safety net underneath me.

Just make sure that your safety net isn't lying directly on the ground. Without
the use of a frame (exception handlers that actually do the right thing to
recover the system), you'll find the landing is just as hard with or without the
net!

You might also want to make sure that the net isn't suspended so high that you're
walking _below_ it, or even worse that you hit your head on the net and it knocks
you off the rope (just to stretch this analogy a bit further). In other words, a
complex exception handling structure might actually _detract_ from the
reliability of your system. There is some merit to the Keep It Simple, Stupid
principle.

>
> -matt
>
> --------------------------------------------------------------------
> Matthew Heaney
> Software Development Consultant
> mhe...@ni.net
> (818) 985-1271

--

Robert Dewar

unread,

Oct 16, 1996, 3:00:00 AM10/16/96

to

Michael Brenner said

"(2) do not generate code for a given instantiation of unchecked_conversion,"

Most of the points are dubious, but I concentrate on this one, because it
is a common confusion. In general, almost any unchecked conversoin you canm
think of will require code on some architecture. The attempt to legislate
such code out of existence is pragmatically badly flawed, never mind being
completely impractical to specify formally (at the level of a language
definition there is no such thing as code!)

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 16, 1996, 3:00:00 AM10/16/96

to

Robert Dewar <de...@MERV.CS.NYU.EDU> writes:
>It *is* possible to write reliable programs, though it is expensive. If you
>need to do this, and are not able to do it, then the answer is to investigate
>the tools that make this possible, and understand the necessary investment
>(which is alarmingly high). Some of these tools are related to correctness,
>but that's not the main focus. There are reliable incorect programs and
>correct unreliable programs, and what we are interested in is reliability.
>

<snip>

>Now of course informally we would like to make all programs realiable, but
>there is a cost/benefit trade off. For most non-safety critical programming
>(but not all), it is simply not cost effective to demand total reliability.
>

You are absolutely correct about the cost. The control software we
build is tested exhaustively from the module level on up to the
integration with physical sensors & actuators well before it gets
to drive an engine on a test stand - much less fly. It *is*
enormously expensive - but in the present day it's the only way to
be sure you aren't trying to fly something that will break.

The point is that our software testing was derived from the same
mindset as our hardware testing (turbine blades, pumps, bearings,
etc.) We probably test a hardware component for an engine even
more rigorously and at greater expense than we do for software -
which is, after all, just another "part" for the engine. The
mistake that is often made when looking at software it to think
that somehow (because it passed the "smoke" test?) it doesn't need
the same sort of rigorous testing we'd demand of any physical
device in order to be proven reliable.

Who would want to fly in an airplane powered by engines, the
design for which had been verified by powering up a single
prototype once and running it for 10 minutes. You'd probably feel
a lot safer if we ran a couple of prototypes right into the
ground, including making them ingest a few birds and deliberately
cutting loose a turbine blade or two at speed. If you want
reliable software, the testing can be no less rigorous.

Ralf Tilch

unread,

Oct 17, 1996, 3:00:00 AM10/17/96

to

--

Hello,

I followed the discussion of the ARIANE 5 failure.
I didn't read all the mail's, and I am quite astonished
how far and how many details can discussed.
Like,
What for program-language would have been the best,
.....

It's good to know what's happen.
I think more important,
you built something new (very complex).
You invest some billion to develop it.
You built it (an ARIANE 5, put several sattelites).
The price of it several hundred millions
and you don't check as much as possible,
make a 'very complete check',
especially the software.

The reason that the software wasn't checked:
It was too 'expensive'?!?!.

They forgot murphy's law, which always 'works'.

I think you can't design a new car without
testing it completely.

We test 95% of the construction and after six month
selling the new car a weel will fall of at 160km/h.
Ok, there was a small problem in the construction-software
some wrong values, due to some over- or underflows or
whatever.

The result, the company probhably will have to pay quite a
lot and probhably to close !

--------------------------------------------------------
-DON'T TRUST YOURSELF, TRUST MURPHY'S LAW !!!!

"If anything can go wrong, it will."
--------------------------------------------------------
With this, have fun and continue the discussion about
conversion from 64bit to 16bit values,etc..

RT

________________|_______________________________________|_
| E-mail : R.T...@gmd.de |
| Tel. : (+49) (0)2241/14-23.69 |
________________|_______________________________________|_
| |

Ravi Sundaram

unread,

Oct 17, 1996, 3:00:00 AM10/17/96

to

Ralf Tilch wrote:
> The reason that the software wasn't checked:
> It was too 'expensive'?!?!.

Yeah, isn't hindsight a wonderful thing?
They, whoever were in charge of these decisions,
too knew testing is important. But it is impossible
to test every subcomponant under every possible
condition. There is simply not enough money or time
available to do that.

Take space shuttle for example. The total computing
power available on board is probably as much as used
in Nintindo gameboy. The design was frozen in 1970s.
Upgrading the computers and software would be so expensive
to test and prove they approach it with much trepidation.

Richard Feyman was examining the practices of NASA and
found that the workers who assembled some large bulkheads
had to count bolts from two refrence points. He thought
providing four reference points would simplify the job.
NASA rejected the proposal because it would involve
too many changes to the documentation, procedures and
testing. (Surely You are joking, Mr Feyman I? or II?)

So praise them for conducting a no nonsense investigation
and owning up to the mistakes. Learn to live with
failed space shots. They will become as reliable as
air travel once we have launched about 10 million rockets.

--
Ravi Sundaram.
10/17/96
PS: I am out of here. Going on vacation. Wont read followups
for a month.
(Opinions are mine, not Ansoft's.)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Keith Thompson

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

In <326506...@lmtas.lmco.com> Ken Garlington <garlin...@lmtas.lmco.com> writes:
[...]

> Not necessarily. Keep in mind that an exception _was_ raised -- a
> predefined exception (Operand_Error according to the report).

This is one thing that's confused me about this report. There is no
predefined exception in Ada called Operand_Error. Either the overflow
raised Constraint_Error (or Numeric_Error if they were using an Ada
83 compiler that doesn't follow AI-00387), or a user-defined exception
called Operand_Error was raised explicitly.

Ken Garlington

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

Keith Thompson wrote:
>
> In <326506...@lmtas.lmco.com> Ken Garlington <garlin...@lmtas.lmco.com> writes:
> [...]
> > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > predefined exception (Operand_Error according to the report).
>
> This is one thing that's confused me about this report. There is no
> predefined exception in Ada called Operand_Error. Either the overflow
> raised Constraint_Error (or Numeric_Error if they were using an Ada
> 83 compiler that doesn't follow AI-00387), or a user-defined exception
> called Operand_Error was raised explicitly.

It confused me too. I'm guessing that language differences are part of the
answer here, but I have no idea. It's also possible that the CPU hardware
specification has something called an "Operand Error" interrupt which is
generated during an overflow, which I assume gets mapped into Constraint_Error
(as is common with the MIL-STD-1750 CPU, for instance).

I also world be interested in any information about "Operand_Error".

>
> --
> Keith Thompson (The_Other_Keith) k...@thomsoft.com <*>
> TeleSoft^H^H^H^H^H^H^H^H Alsys^H^H^H^H^H Thomson Software Products
> 10251 Vista Sorrento Parkway, Suite 300, San Diego, CA, USA, 92121-2706
> FIJAGDWOL

--

Samuel T. Harris

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

Keith Thompson wrote:
>
> In <326506...@lmtas.lmco.com> Ken Garlington <garlin...@lmtas.lmco.com> writes:
> [...]
> > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > predefined exception (Operand_Error according to the report).
>
> This is one thing that's confused me about this report. There is no
> predefined exception in Ada called Operand_Error. Either the overflow
> raised Constraint_Error (or Numeric_Error if they were using an Ada
> 83 compiler that doesn't follow AI-00387), or a user-defined exception
> called Operand_Error was raised explicitly.
>

Remember, the report does NOT state that an unchecked_conversion
was used (as some on this thread have assumed). It only states

a "data conversion from 64-bit floating point to 16-bit signed

integer value". As someone (I forget who) pointed out early
in the thread weeks ago, a standard practice is to scale down
the range of a float value to fit into an integer variable.
This may not have been an unchecked_conversion at all, but
some mathimatical expression.

Whenever software is reused, it must be reverified AND
revalidated. The report cites several reasons for not
reverifying the reuse of the SRI from the Ariane 4. Any
one of which may be justifiable. However, a cardinal rule
of risk management is that any risk to which NO measures
are applied remains a risk. Here they justified their way
into applying no measures at all toward insuring the stuff
would work.

The report also states that the code which contained the
conversion was part of a feature which was now obsolete
for the Ariane 5. It was left in "presumably based on the view that,
unless proven necessary, it was not wise to make changes in software
which worked well on Ariane 4." While this does make good sense,
it is not by any means a verification nor a validation.
It just seems to mitigate your risk, but it really does
no such thing. You can't let such thinking lull you into
a false sense of security.

The analysis which lead to protecting four variables from
Operand_Error and leaving 3 unprotected was not revisited
with the new environment in mind. How could it be since
the Ariane 5 trajectory data was not included as a function
requirement. Hence this measure does not apply to the risk
of the Ariane 5, though some in the decision may have relied
upon it for just that protection.

Then they went as far as not revalidating the SRI in an
Ariane 5 environment, which was the real hurt. While the
report states the Ariane 5 flight data was not included as
a functional requirement, someone should have asked for it
if they needed it. Its omission means any verification testing
which was done would not have taken it into account.
So it would have been verified (which is testing against what
the user said he wanted). However, validation testers (who
test what the user actually wants and are supposed to be
smart enough NOT to take the specification at face value)
should have insisted on such data, included or not.
That's the silly part about the whole affair, validation
testing also was not performed.

The report then goes on to discuss why the SRI's were not
included in a closed-loop test. So even if the Ariane 5
trajectory data had been included as a functional requirement,
it would not have helped. While the technical reasons
cited are appropriate for a verification test, the report
correctly points out that the goals of validation testing
are not so strigently dependent on the fidelity of the test
environment so those reasons just don't justify not having
the SRI's in at least one validation test using Arian 5
trajectory data, especially when other measures have NOT
been taken to insure a compatible reuse of software.

In fact, section 2.1 states "The SRI internal events that
led to the failure have been reproduced by simulation calculations."
I wonder if they compiled and ran the Ada code on another
platform (which is a viable way of doing a lot of testing
for embedded software prior to embedding the software).
The report does not state if such testing was performed
by the developer. If the developer done such testing, then
the Ariane 5 trajectory data would have spotted the flaw.
If such testing was done, someone would have to ask
explicitly for such data.

The end of secion 2.3 summarizes the fact that the reviews
did not pick up the fact that of all potential measures which
could have been applied to determine a compatible reuse of
software into the Ariane 5 operational environment, NONE of
them were actually performed. Which left the reviewers
blissfully ignorant of an unmitigated risk glaring them in
the face.

Of the SRI, I conclude ...

No design error (though it could have done something better).
No programming error (given the design).
An arguable specification error (but without appropriate testing).
A lapse in validation testing (assuming other non-existance measures).
A grave risk management and oversite problem.

Bottom line, a management (both customer and contractor) problem.

The OBC and main computer are another matter entirely.

I've not seen anyone on this thread address the entries
3.1.f and g concerning the SRI sending diagnostic data (item f)
which was interpreted as flight data by the launcher's main
computer (item g). Section 2.1 states the backup failed first and
declared a failure and the OBC could not switch to it because
it already ceased to function. It seems the OBC knew about
the failures, so why did the main computer still interpret
any data from a failed component as flight data.

That seems like a design or programming problem. It is
blind luck that the diagnostic data caused the main computer
to try to correct the trajectory via extreme positions of the
thruster nozzles which caused the rocket to turn sideways
to the air flow which caused buckling in the superstructure
which caused the self-destruct to engage.

Given the design philosophy of the designers, had the main
computer known both SRI had failed, it should have signaled a
self-destruct right then and there. What would have happened
if the "diagnostic" data caused minor course corrections and
brought the rocket over a population area before the subsequent
course or events (or the ground flight controllers themselves)
signaled a self-destruct?

The report does not delve into this aspect of the problem
which I consider to be even more important. This tends to
tell me the SRI simulators in the closed-loop testing which
was performed were not used to check malfunctions, or if
they were, then the test scenarios are incomplete or flawed.

How many other interface/protocol/integration problems
are waiting to crop up? Which reused Arian 4 software component
will fail next? Stay tuned for these and other provocative
questions on "As the Arian Burns" ;)

I wonder how the payload insurance companies will repond with
their pricing for the next couple of launches.

--
Samuel T. Harris, Senior Engineer
Hughes Training, Inc. - Houston Operations
2224 Bay Area Blvd. Houston, TX 77058-2099
"If you can make it, We can fake it!"

Ken Garlington

unread,

Oct 18, 1996, 3:00:00 AM10/18/96

to

Marin David Condic, 407.796.8997, M/S 731-93 wrote:

> Who would want to fly in an airplane powered by engines, the
> design for which had been verified by powering up a single
> prototype once and running it for 10 minutes. You'd probably feel
> a lot safer if we ran a couple of prototypes right into the
> ground, including making them ingest a few birds and deliberately
> cutting loose a turbine blade or two at speed. If you want
> reliable software, the testing can be no less rigorous.

Well, I know that on the YF-22 program, one of the engine manufacturers
did in fact cut loose a few turbine blades during system test -- although,
in that case, it was unintentional. We also ran one of the aircraft into the
ground -- again, unintentionally.

As for the birds, there is an interesting test done here in Fort Worth. (At
least, we used to do it -- I haven't actually witnessed one of these tests lately).
To determine if the canopy will survive a bird strike, they actually take a bird
(presumably of mil-spec size and weight), load it into a cannon-type deveice, and
fire the bird at the canopy. By the way, it's not a good idea to use a _frozen_
bird for this test...

Frank Manning

unread,

Oct 19, 1996, 3:00:00 AM10/19/96

to

In article <326782...@lmtas.lmco.com> Ken Garlington
<garlin...@lmtas.lmco.com>

> As for the birds, there is an interesting test done here in Fort Worth.
> (At least, we used to do it -- I haven't actually witnessed one of these
> tests lately). To determine if the canopy will survive a bird strike,
> they actually take a bird (presumably of mil-spec size and weight), load
> it into a cannon-type deveice, and fire the bird at the canopy. By the
> way, it's not a good idea to use a _frozen_ bird for this test...

When I was in the Air Force, I heard a rumor there was an Air
Force facility that used chickens for similar testing. At one
time the guy in charge was a certain Colonel Sanders...

-- Frank Manning

Norman H. Cohen

unread,

Oct 21, 1996, 3:00:00 AM10/21/96

to

Similar testing was done in the Chinese air force. The program was so
successful that its director, Colonel Tso was promoted to the rank of
general.

:-)

--
Norman H. Cohen
mailto:nco...@watson.ibm.com
http://www.research.ibm.com/people/n/ncohen

Ken Garlington

unread,

Oct 21, 1996, 3:00:00 AM10/21/96

to

Samuel T. Harris wrote:
>
> Keith Thompson wrote:
> >
> > In <326506...@lmtas.lmco.com> Ken Garlington <garlin...@lmtas.lmco.com> writes:
> > [...]
> > > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > > predefined exception (Operand_Error according to the report).
> >
> > This is one thing that's confused me about this report. There is no
> > predefined exception in Ada called Operand_Error. Either the overflow
> > raised Constraint_Error (or Numeric_Error if they were using an Ada
> > 83 compiler that doesn't follow AI-00387), or a user-defined exception
> > called Operand_Error was raised explicitly.
> >
>
> Remember, the report does NOT state that an unchecked_conversion
> was used (as some on this thread have assumed). It only states
> a "data conversion from 64-bit floating point to 16-bit signed
> integer value". As someone (I forget who) pointed out early
> in the thread weeks ago, a standard practice is to scale down
> the range of a float value to fit into an integer variable.
> This may not have been an unchecked_conversion at all, but
> some mathimatical expression.

In fact, I would be very surprised if unchecked_conversion was used.
It wouldn't make much sense to convert from float to fixed using UC.
More than likely, the constraint error/hardware interrupt was raised
due to an overflow of the 16-bit value during the type conversion part
of the scaling equation.

Marin David Condic, 407.796.8997, M/S 731-93

unread,

Oct 21, 1996, 3:00:00 AM10/21/96

to

Frank Manning <fr...@BIGDOG.ENGR.ARIZONA.EDU> writes:
>In article <326782...@lmtas.lmco.com> Ken Garlington
><garlin...@lmtas.lmco.com>
>
>> As for the birds, there is an interesting test done here in Fort Worth.
>> (At least, we used to do it -- I haven't actually witnessed one of these
>> tests lately). To determine if the canopy will survive a bird strike,
>> they actually take a bird (presumably of mil-spec size and weight), load
>> it into a cannon-type deveice, and fire the bird at the canopy. By the
>> way, it's not a good idea to use a _frozen_ bird for this test...
>
>When I was in the Air Force, I heard a rumor there was an Air
>Force facility that used chickens for similar testing. At one
>time the guy in charge was a certain Colonel Sanders...
>

There is, in fact, a Mil Spec bird for bird-ingestion tests on jet
engines. (Similar procedure - fire 'em out of a cannon into the
turbine blades and film at high speed so you can watch it get
sliced into cold-cuts.) The specification may well apply to canopy
impact tests also since it would be seeing similar takeoff/landing
profiles.

I hear the Navy has it's own standard for bird-ingestion. The
birds that follow aircraft carriers are apparently larger than
a Mark One/Mod Zero Air Force bird.

MDC
Marin David Condic, Senior Computer Engineer ATT: 561.796.8997
M/S 731-96 Technet: 796.8997
Pratt & Whitney, GESP Fax: 561.796.4669
P.O. Box 109600 Internet: COND...@PWFL.COM
West Palm Beach, FL 33410-9600 Internet: CON...@FLINET.COM
===============================================================================

"If you don't say anything, you won't be called on to repeat it."

-- Calvin Coolidge
===============================================================================

shm...@os2bbs.com

unread,

Oct 22, 1996, 3:00:00 AM10/22/96

to

In <326674...@ansoft.com>, Ravi Sundaram <ra...@ansoft.com> writes:
>Ralf Tilch wrote:
>> The reason that the software wasn't checked:
>> It was too 'expensive'?!?!.
>
> Yeah, isn't hindsight a wonderful thing?
> They, whoever were in charge of these decisions,
> too knew testing is important. But it is impossible
> to test every subcomponant under every possible
> condition. There is simply not enough money or time
> available to do that.

Why do you assume that it was hindsight? They violated fundamental
software engineering principles, and anyone who has been in this business
for long should have expected chickens coming home to roost, even if they
couldn't predict what would go wrong first.

> Richard Feyman was examining the practices of NASA and
> found that the workers who assembled some large bulkheads
> had to count bolts from two refrence points. He thought
> providing four reference points would simplify the job.
> NASA rejected the proposal because it would involve
> too many changes to the documentation, procedures and
> testing. (Surely You are joking, Mr Feyman I? or II?)
>
> So praise them for conducting a no nonsense investigation
> and owning up to the mistakes. Learn to live with
> failed space shots. They will become as reliable as
> air travel once we have launched about 10 million rockets.

I hope that you're talking about Ariane and not NASA Challenger; Feynman's
account of the behavior of most of the Rogers Commission, in "Why Do
You Care ..." sounds more like a failed coverup than like "owning up to
their mistakes", and Feynman had to threaten to air a dissenting opinion
on television before they agreed to publish it in their report.

Shmuel (Seymour J.) Metz
Atid/2

Adam Beneschan

unread,

Oct 22, 1996, 3:00:00 AM10/22/96

to

"Marin David Condic, 407.796.8997, M/S 731-93" <cond...@PWFL.COM> writes:

> There is, in fact, a Mil Spec bird for bird-ingestion tests on jet
> engines. (Similar procedure - fire 'em out of a cannon into the
> turbine blades and film at high speed so you can watch it get

> sliced into cold-cuts.) . . .

So that's where those MRE's come from . . .

:)

-- Adam

Jim Carr

unread,

Oct 22, 1996, 3:00:00 AM10/22/96

to

shmue...@os2bbs.com writes:
>
>I hope that you're talking about Ariane and not NASA Challenger; Feynman's
>account of the behavior of most of the Rogers Commission, in "Why Do
>You Care ..." sounds more like a failed coverup than like "owning up to

>their mistakes", ...

The coverup was not entirely unsuccessful. Feynman did manage to break
through and get his dissenting remarks on NASA reliability estimates
into the report (as well as into Physics Today), but the coverup did
succeed in keeping most people ignorant of the fact that the astronauts
did not die until impact with the ocean despite a Miami Herald story
pointing that out to its mostly-regional audience.

Did you ever see a picture of the crew compartment?

--
James A. Carr <j...@scri.fsu.edu> | Raw data, like raw sewage, needs
http://www.scri.fsu.edu/~jac | some processing before it can be
Supercomputer Computations Res. Inst. | spread around. The opposite is
Florida State, Tallahassee FL 32306 | true of theories. -- JAC