software engineering for scientific computing

Jonathan Thornburg [remove -animal to reply]

unread,

Jun 9, 2013, 4:27:51 PM6/9/13

to

I recently ran across a nice paper on software-engineering issues for
scientific computing:

G. Wilson et al,
"Best Practices for Scientific Computing"
arXiv:1210.0530

There's nothing here that's particularly newsworthy for
software-engineering experts... but I know plenty of
scientists-who-spend-most-of-their-time-working-on-software who
aren't software-engineering experts. I think many s.p.r readers
could benefit from reading this short (6-page) paper.

Unlike a lot of the software-engineering literature that I've seen,
this paper does NOT make assumptions like
* we know ahead of time precisely what our software should do
* we know ahead of time precisely how our software should do it
* we know ahead of time what the "correct" output of our software is

ciao,

--
-- "Jonathan Thornburg [remove -animal to reply]" <jth...@astro.indiana-zebra.edu>
Dept of Astronomy & IUCSS, Indiana University, Bloomington, Indiana, USA
on sabbatical in Canada through September 2013
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam

Jonathan Thornburg [remove -animal to reply]

unread,

Jun 10, 2013, 3:05:22 AM6/10/13

to

I wrote:
> I recently ran across a nice paper on software-engineering issues for
> scientific computing:

[[...]]

Another reference of note (really more a "horror story that we'd like
to avoid") is Hatton's "T-Experiments", in which he showed that different
software packages used in the oil/gas industry for analyzing seismic
reflection data gave very different results when applied to the same
input data:
http://kar.kent.ac.uk/21557/1/THE_T-EXPERIMENTS_ERRORS_IN.pdf
The differences seemed to be due to the accumulation of many small
(individually-reasonable) differences in which ranges of data to
select/exclude, treatment of boundary conditions, choice of integration
constants, etc. But the overall results are at best disquietening, and
at worst horrific.

In the scientific research areas that I'm most familiar with (numerical
simulations of black hole or neutron star orbits/collisions and the
resulting gravitational-radiation waveforms) the situation is much better:
researchers now routinely compare independently-written codes on test
problems, and the results generally agree quite well.

For example, Hannam et al (arXiv:0901.2437 = Phys Rev D 79, 084025)
compared 5 different numerical-relativity codes for simulating binary
black hole coalescence, and found excellent agreement. This agreement
is particularly impressive considering the diversity of computational
approaches: 4 different finite-difference codes using the BSSN
formulation of the Einstein equations with moving punctures, and a
pseudo-spectral code using a harmonic formulation of the Einstein
equations with excision.

Two other examples of note (both also finding good agreement):
* Baiotti, Shibata, and Yamamoto (arXiv:1007.1754)
compared general-relativistic-hydrodynamics codes for simulating
binary neutron-star mergers
* Sago, Barack, and Detweiler (arXiv:0810.2530 = Phys Rev D 78, 124024)
compared codes for computing the gravitational radiation-reaction
"self-force" acting on extreme mass-ratio orbiting binary black holes

I would like to hope that other research fields which rely heavily on
large-scale computation, have similar code-validation projects.

Nicolaas Vroom

unread,

Jun 26, 2013, 3:27:32 AM6/26/13

to

Op zondag 9 juni 2013 22:27:51 UTC+2 schreef Jonathan
Thornburg [remove -animal to reply] het volgende:

> I recently ran across a nice paper on software-engineering
> issues for scientific computing:
>
> G. Wilson et al,
> "Best Practices for Scientific Computing"

http://arxiv.org/abs/1210.0530

At page 2 they write:
" a workflow management tool can be used.
The paradigmatic example is compiling and linking programs
in languages such as Fortran, C++, Java, and C#.
The most widely used tool for this task is probably Make "
For scientific programming by engineers IMO two important
constraints are Readability and Simplicity. Only Fortran
falls in that category. Visual Basic and Pascal also.

> Unlike a lot of the software-engineering literature that I've
> seen, this paper does NOT make assumptions like

> 1. we know ahead of time precisely what our software should do
> 2. we know ahead of time precisely how our software should do it
> 3. we know ahead of time what the "correct" output of our software is

IMO item(1) is very important to know what your program is supposed
to do. It is very important to know the physics behind your program.

A Typical(?) program is the program CAMB.
For a adapted listing see this:
http://users.telenet.be/nicvroom/CAMB_all_html.htm

The issue is very important because if you want to understand
how to calculate the cosmological parameters based on the Cosmic
Microwave Background Radiation you have to use this program CAMB
implying indirect to have to understand the physics behind the program.
(And that is not easy)

I agree with item (2) that you do not have to know ahead of time
how the final program looks like.

Item (3) is also very important i.e. what the correct output should be.
You need test cases.
The issue is discussed correctly at the document at page 3 in the
paragraph "Write and run tests".
For example in the case of CMB you need test cases to demonstrate
that your program to calculate the Power Spectrum is correct.
You need Sudoku test cases to test your Sudoku solver.

I do not agree with pragraph 8 :
"Optimize software only after it works correctly"
Specific the line:
" Since faster, lower level, languages require more lines of code to
accomplish the same task, scientists should write code in the highest-
level language possible, and shift to low-level languages like C and
Fortran only when they are sure the performance boost is needed "
IMO you should write you scientific program directly in
Fortran, Visual Basic or Pascal because these languages
are required to give enough detail.
Speed in most cases is not the issue.

For software interesting people read this:
http://users.telenet.be/nicvroom/performance.htm

Nicolaas Vroom.

Thomas Smid

unread,

Jul 15, 2013, 3:06:14 AM7/15/13

to

On Monday, June 10, 2013 7:05:22 AM UTC, Jonathan Thornburg [remove -animal to reply] wrote:

> Another reference of note (really more a "horror story that we'd like
> to avoid") is Hatton's "T-Experiments", in which he showed that different
> software packages used in the oil/gas industry for analyzing seismic
> reflection data gave very different results when applied to the same
> input data:
>
> http://kar.kent.ac.uk/21557/1/THE_T-EXPERIMENTS_ERRORS_IN.pdf
>
> The differences seemed to be due to the accumulation of many small
> (individually-reasonable) differences in which ranges of data to
> select/exclude, treatment of boundary conditions, choice of integration
> constants, etc. But the overall results are at best disquietening, and
> at worst horrific.
>
> In the scientific research areas that I'm most familiar with (numerical
> simulations of black hole or neutron star orbits/collisions and the
> resulting gravitational-radiation waveforms) the situation is much better:
> researchers now routinely compare independently-written codes on test
> problems, and the results generally agree quite well.

I think your argument is quite misleading here in several respects:

1) The reference you quote refers to a study made 20 years ago.

2) Its author explicitly stresses that in the industry software is
usually developed independently by each company, wheres scientists
usually swap code (hence the results are more likely to be similar in
the latter case)

3) Certainly you should expect different results for different boundary
conditions or different data sets, so I don't see why such a dependence
should be a bad thing (on the contrary on can learn a lot from it).

4) Having all software packages producing the same result does not mean
this result is correct. The underlying theory may be wrong after all.
Conformity in science can be a dangerous thing.

Thomas Smid

Jonathan Thornburg [remove -animal to reply]

unread,

Jul 26, 2013, 1:39:32 PM7/26/13

to

Nicolaas Vroom <nicolaa...@pandora.be> wrote:
[[about

> G. Wilson et al,
> "Best Practices for Scientific Computing"
> http://arxiv.org/abs/1210.0530

]]

> I do not agree with pragraph 8 :
> "Optimize software only after it works correctly"
> Specific the line:
> " Since faster, lower level, languages require more lines of code to
> accomplish the same task, scientists should write code in the highest-
> level language possible, and shift to low-level languages like C and
> Fortran only when they are sure the performance boost is needed "
> IMO you should write you scientific program directly in
> Fortran, Visual Basic or Pascal because these languages
> are required to give enough detail.
> Speed in most cases is not the issue.

Some research projects benefit from using multiple languages. For
example, my current main research project involves a mixture of
(a) complicated symbolic-algebra computations, which generate series
expansions for...
(b) large numerical computations (typically taking a week to a month of
of CPU time on a dozen or so dual-core processors), which generate
data files for ...
(c) some smaller numerical computations (typically taking less than
a minute on a laptop) to produce our final results
We (my colleague and I) use a mixture of Mathematica and Maple for (a),
a mixture of C++, C, and Fortran 77 for (b), and a mixture of Perl and
C for (c).

There are of course many possible ways to organize this set of
computations, but I think no one language is well-suited for all of
(a), (b), and (c).

For various historical reasons, essentially all researchers working
in this field use Mathematica for (a).

(b) is a "traditional number-crunching" code. Some of my colleagues
use C for this sort of code, some use Fortran 90/95/2003/2008, and I
use C++. Since these codes are already "painfully slow" (running times
of up to a month), I don't see rewriting in higher-level scripting
languages as reasonable for this part of the computation.

The tradeoffs between different languages for this sort of
number-crunching are interesting, but probably outside this
newsgroup's scope.

(c) is an interesting case: This program needs to read 20 or so data
files (written by the (b) computations), match up and sum corresponding
entries in the different data files, do a few hundred least-squares fits
to estimate some additional terms in those sums, incorporate some
coefficients from a small symbolic-algebra computation, and finally
output results.

I wrote a predecessor code in C++, calling Fortran or C libraries for
the least-squares fitting. For this project I write this code in Perl
(calling a C library for least-squares fitting). Given this experience,
, I'm quite confident that the Perl version is preferable to either of
the C++ versions: it's "fast enough" (it typically runs in ~ 30 seconds
on my laptop) while being considerably easier to maintain/modify/enhance
than the C++ versions. I think this is an excellent illustration of
Wilson et al's suggestion
# Since faster, lower level, languages require more lines of code to
# accomplish the same task, scientists should write code in the highest-
# level language possible, and shift to low-level languages like C and
# Fortran only when they are sure the performance boost is needed.

> IMO you should write you scientific program directly in
> Fortran, Visual Basic or Pascal because these languages
> are required to give enough detail.

I'd like to call everyone's attention to a classic Bell Labs tech
report arguing against the use of Pascal for "serious programming":

Brian W Kernighan
"Why Pascal is Not My Favorite Programming Language"
April 2, 1981
currently at
http://www.lysator.liu.se/c/bwk-on-pascal.html
(and many other places too)

--
-- "Jonathan Thornburg [remove -animal to reply]" <jth...@astro.indiana-zebra.edu>
Dept of Astronomy & IUCSS, Indiana University, Bloomington, Indiana, USA

on sabbatical in Canada through late August 2013
"There was of course no way of knowing whether you were being watched
at any given moment. How often, or on what system, the Thought Police
plugged in on any individual wire was guesswork. It was even conceivable
that they watched everybody all the time." -- George Orwell, "1984"

Jos Bergervoet

unread,

Jul 28, 2013, 5:27:50 AM7/28/13

to

On 7/26/2013 7:39 PM, Jonathan Thornburg [remove -animal to reply] wrote:
> Nicolaas Vroom<nicolaa...@pandora.be> wrote:
> [[about
>> G. Wilson et al,
>> "Best Practices for Scientific Computing"
>> http://arxiv.org/abs/1210.0530

...
...

> (b) is a "traditional number-crunching" code.

Since the dispute is about optimization this seems
the category to look at.

> Some of my colleagues
> use C for this sort of code, some use Fortran 90/95/2003/2008, and I
> use C++.

They share the fact that they are standardized (by ISO,
IEC or similar) and all three are part of the omnipresent
public domain gcc package. (So the choice is down to three.)

> Since these codes are already "painfully slow" (running times
> of up to a month), I don't see rewriting in higher-level scripting
> languages as reasonable for this part of the computation.

In addition, a high level of abstractness is already
available in 2 of the three choices (C++ and Fortran >90)
for those who want it. Preferring scripting languages for
more structured programming seems a thing of the past.
Nowadays, their level is not really much higher any more.
So you're right, we can forget about them here.

> The tradeoffs between different languages for this sort of
> number-crunching are interesting,

Yes! Because the arguments are completely scientific:

1) C requires about 3 times more lines of code because
(of the three choices) it is the only one lacking almost
all high-level features (but it's easier to learn!)
2) Fortran can be inherently faster than C++ because it
ignores the possibility of pointer aliasing (leaving it
as a burden to the user to understand the implications).

> ... but probably outside this
> newsgroup's scope.

Probably. For explanation of the (tricky) point 2) see:
http://stackoverflow.com/questions/146159/is-fortran-faster-than-c

Anyhow, if you know your objective the case seems in fact
completely settled after our analysis!

> I'd like to call everyone's attention to a classic Bell Labs tech
> report arguing against the use of Pascal for "serious programming":

Exactly, leave that one out, and others too!
"Three choices ought to be enough for anyone.."

--
Jos

Lester Welch

unread,

Aug 1, 2013, 3:30:58 AM8/1/13

to

>
> "Best Practices for Scientific Computing"
>
> arXiv:1210.0530

I'm a retired physicist whose career was writing data acquisition
software for accelerators - which had to be very fast and configurable
to accommodate a large user community. The heart of the program was
written in assembly language - and in some cases the code was
self-modifying to avoid the cpu time consumption of "if" statements.
Debugging was a joy. In 1988 I wrote a paper "Programming style: An
example" (Computers in Physics, Sep/Oct 1988, p 65) (I couldn't fine an
electronic version.) which addressed a few of the issues of software
engineering before its time.

[Moderator's note: Any future posts in this thread should be
sufficiently relevant to physics. -P.H.]