Suggestion: Two levels of tests

Jakob Schiøtz

unread,

Apr 12, 2012, 8:44:31 AM4/12/12

to ope...@googlegroups.com

Hi OpenKIM folks!

I would like to make a suggestion, running the risk of suggesting the obvious and something that is already planned :-)

As I understand it, tests in OpenKIM play the role of testing the quality of the models by calculating quantities that can then be compared against experiments or quantum calculations. Just like there are model drivers and models, and most models will probably be implemented through model drivers, I understood that there will be a similar two-level structure for tests (simulators and tests?), so all the nitty-gritty of making neighbor lists etc can be taken care of once and for all, and writing the tests will be relatively simple.

I would like to suggest to extend this to a three-level structure: Simulators, tests and code-tests. The purpose of the code-test is to check the *implementation* of the models, the tests and the simulators, and in particular to catch when something unexpectedly stops working due to apparently unrelated changes.

A code-test could be a simple way of specifying a model, a test, an expected result and a tolerance. All code-tests should then be run automatically regularly, and a red flag raised if a result falls outside the expected tolerance. Preferably, a contributor should be able to run all code-tests relating to a given model, model driver, test or simulator prior to submitting a new version.

This could probably be implemented with code-tests as simple text files, and a job running all the tests based on these files. But it would require that tests have a standard way of getting their input (the model) and presenting their result. I guess such a standard will be needed anyway for the OpenKIM web infrastructure.

Best regards

Jakob

--
Jakob Schiøtz, Ph.D.
Associate professor (lektor)
Program coordinator, M.Sc. in Physics and Nanotechnology
(kandidatstudieleder, Fysik og Nanoteknologi)
CINF, Department of Physics
Technical University of Denmark
DK-2800 Kongens Lyngby, Denmark
http://www.cinf.dtu.dk/~schiotz/

James Sethna

unread,

Apr 12, 2012, 8:45:44 PM4/12/12

to ope...@googlegroups.com, Ellad Tadmor, Ryan S. Elliott

Jakob:

We've been planning several code-tests: for example, tests to check whether the potential forces are the derivatives of the energies. Is there a reason to distinguish them from regular tests? I even think it might be good to store their results in the database...

Jim

--
Jim Sethna, set...@lassp.cornell.edu, (607) 255-5132, FAX 6428
http://www.lassp.cornell.edu/sethna/sethna.html
Statistical Mechanics: Entropy, Order Parameters and Complexity
Available now at Amazon and Barnes and Noble, or direct from
Oxford Univ. Press: http://www.physics.cornell.edu/sethna/StatMech

Ryan S. Elliott

unread,

Apr 12, 2012, 10:56:06 PM4/12/12

to ope...@googlegroups.com

Hi everyone,

Jim is right, we have already been planning to have things like the
derivatives check. As Jim says, there is not necessarily a reason to
distinguish these from regular Tests. However, we are planning to do so;
at least in some situations. We have been calling these "verification
checks". The reason for making this distinction is that we are planning
to have these be checks that are run and must be passed (or else have a
good reason for not passing) before a Model is officially accepted into
the openKIM system...

Notice however, that all this does not actually directly address Jacob's
suggestion. Jacob was (I think) actually suggesting a different kind of
check. One that can apply to Tests as well as Models (unlike the
verification checks discussed above).

This seems like a good idea. In fact, let me modify Jacob's suggestion a
bit. It seems to me that the idea is to have a way to notice when the
results of a calculation have changed. That way, if they should not have
changed, one can identify a previously hidden bug and start to track it
down.

I think what this boils down to is the following: We would like a way to
take two sets of results/output (Predictions) from a Test (or Verification
Check) and compute a real number that represents the "sameness" of these
two results. We would also like to have a tolerance that can be used to
define "equal".

It seems to me that there is no hope (or at least very little hope) that
this can be defined in general. So, I think this would best be something
that is defined on a Test/Verification by Test/Verification basis. So, we
could encourage Test/Verification writers to also provide a script of
some sort that takes in two sets of results and prints their sameness
value. The same Test/Verification writer should also provide the
tolerance value that can be used to make an equal/not equal determination.

This can certainly be used to check when the result of a single Test/Model
pairing changes. Which would indicate that either there is a bug or that
some intentional update has been made to either the Test or the Model.
But, it could also be used to identify different Models that produce the
same result for a given Test.

Anyway, this would seem to eliminate the need for defining a standard way
for Tests to communicate. (Although, such a standard may, ultimately, be
necessary for the openKIM system, anyway...)

Your thoughts are welcome...

Cheers,

Ryan

On Thu, 12 Apr 2012, James Sethna wrote:

> Jakob:
>
> We've been planning several code-tests: for example, tests to check whether
> the potential forces are the derivatives of the energies. Is there a reason
> to distinguish them from regular tests? I even think it might be good to
> store their results in the database...
>
> Jim
>

>> Jakob Schi?tz, Ph.D.

>> Associate professor (lektor)
>> Program coordinator, M.Sc. in Physics and Nanotechnology
>> (kandidatstudieleder, Fysik og Nanoteknologi)
>> CINF, Department of Physics
>> Technical University of Denmark
>> DK-2800 Kongens Lyngby, Denmark
>> http://www.cinf.dtu.dk/~schiotz/
>>
>>
>
>

--
Ryan S. Elliott, Ph.D. and Associate Professor
Russell J. Penrose Faculty Fellow
Aerospace Engineering & Mechanics, University of Minnesota
(612) 624-2376 (626-1558 fax)
http://www.aem.umn.edu/~elliott/
download vCard <http://www.aem.umn.edu/~elliott/relliott.vcf>

KIM Editor (http://openKIM.org)
MSI Associate Fellow (http://msi.umn.edu)

CMT textbook webpage (http://modelingmaterials.org)
----------
The whole problem with the world is that fools and fanatics are always so
certain of themselves, but wiser people so full of doubts.

Bertrand Russell
----------

Jakob Schiøtz

unread,

Apr 13, 2012, 8:43:43 AM4/13/12

to ope...@googlegroups.com, James Sethna, Ellad Tadmor, Ryan S. Elliott

Dear Jim and Ryan,

Thanks for your replies!

Jim: I think that there *is* a need to distinguish these two kinds of test, as they answer two completely different questions. The tests in the database answer the question "Is this model a good model for calculating that quantity?". The code-checks (unit tests) answer the question "is the code buggy?". The only reason to bunch them together is that the existence of the former kind of tests makes the latter kind easy bordering on trivial to implement.

It is clear that consistency checks like numerical derivatives are one part of this code testing, and should be performed for all models. But it is easy to make bugs that produce wrong but consistent energies and forces. I actually managed to demonstrate that more than once during the workshop! :-) In particular, most errors in the neighbor list generation or handling will produce self-consistent errors.

The developer of a model will be able to use existing tests to generate code-tests for his model, and vice versa for the test developers. The existence of such a test suite is essential for reliable code. We have used it extensively in for example our GPAW DFT package, where a test suite is run every night on the current SVN version. Hardly a week passes without something coming up, typically unexpected side-effects of changes/bug fixes! Developers are also running these tests on their own machines, sometimes finding architecture-dependent issues. In fact, these unit tests have become so important that there are even coverage test checking that every line of code is executed at least once in the unit tests (or somebody understands why it is not necessary to execute that line). Coverage tests in a multi-language environment will be a challenge, however (to put it mildly).

The vast majority of the GPAW tests are actually calculations of real values, and writing them requires some effort. The OpenKIM infrastructure should on the other hand make it virtually trivial to write the vast majority of these, as it will just be a question of checking for the value of some already existing calculation being the same as it used to be. It might even not be necessary to do anything - one could in principle recalculate the whole KIM database and look for changed values, and then raise a read flag. Unfortunately, "changed" is not well-defined.

Since the KIM infrastructure will anyway need some standardized way of collecting values from tests, why not use it to make it easy to make such tests? This is not urgent, though :-)

James Sethna

unread,

Apr 13, 2012, 9:06:01 AM4/13/12

to ope...@googlegroups.com, Ellad Tadmor, Ryan S. Elliott

Jakob:

That sounds fantastic. I think we all agree that code-tests like this
are crucial, although your experience suggests that they can be much
more important than I had realized. (How do you write a "coverage
test"?) I would support a separate name for them, even if the
interface and functionality is similar, just to emphasize that we
agree they are crucial for a professional code. (Part of our mission
is to set standards for good potential programming practice, after
all.) What does everyone think?

Jim

--

Jakob Schiøtz

unread,

Apr 13, 2012, 9:26:37 AM4/13/12

to ope...@googlegroups.com, James Sethna

On 13 Apr, 2012, at 15:06, James Sethna wrote:

> (How do you write a "coverage
> test"?)

That is not a trivial task, and I have no idea how to do it.

GPAW actually only does coverage testing for the part implemented in Python. There are preexisting tools for that, the most used is apparently coverage.py (see http://nedbatchelder.com/code/coverage/ ). I think it is also possible to use various kinds of profiling tools for coverage test of compiled code, but I have no experience with it.

Reply all

Reply to author

Forward