best use of our testing time?

Bob Carpenter

unread,

Jan 1, 2014, 4:47:41 PM1/1/14

to stan...@googlegroups.com

Unit testing gives me a lot of time for reflection.

Summary
-----------------------------------

I feel we're missing the forest for the trees in testing.

Given our limited time budget, I think it would be better spent
with more functional specification and testing and less time
unit testing.

Example of Unit Testing Pain
------------------------------------

To plug the memory leak in softmax, I:

1. first verified there was a leak writing a test and then looking
at the process in the Activity Monitor (Mac process monitor).

2. fixed the leak by replacing "operator new" with "memalloc_.alloc".

3. verified there was no longer a leak using the activity monitor.

Now the fun begins.

4. I wrote e-mail to the mailing list asking how to test a memory leak.
As of 2.1.0, there was no testing at all that we aren't leaking memory.

5. Ben suggested a Windows-incompatible test on memory, and given I
don't know how to do the same in Windows, I didn't look further.

6. Daniel suggested monitoring stack size, but the problem was that
the allocation wasn't happening on the stack, just on the heap, so
I couldn't use a simple stack size test.

7. Instead, I wrote a new function in stack_alloc that monitors the
amount of memory allocated to Stan through memalloc_ instance. But
that doesn't line up real well with memory requested through
stack_alloc::alloc due to unused bytes at the end of blocks.

8. Then I had to go write a test for stack_alloc::bytes_allocated. I had initially
wanted to track memory allocated from memalloc_, but that turns out
to be impossible given that we don't fill blocks or keep track of where
the memory went (nor do we need to!).

9. Next, I went back and wrote a test for softmax that failed in the
old version and succeeded in the new version by printing out bytes
allocated in our current code. The key here is that the memory should've been
allocated on this stack, not the general C++ heap. This test is extremely brittle
and dependent on current data structure sizes and block size constants. It will
have to be refiddled if we change the constants in stack_alloc or the data structures
in softmax. (This is an example of the kinds of problem Michael anticipated and
we've already run into with the command-line tests.)

What I Should Have Done
-------------------------------------

To be properly modular, what I should've done was written a new
feature request for the bytes_allocated function, specified it, tested
it, then written the unit test for that.

Instead, I sloppily introduced a new public function in an unrelated part of
the memory allocation code simply to support tests.

At least I put it in a different commit, making sure all of the commits
passed all of our unit tests (a huge time sink for the computer, if not
for me).

Was it Worth it?
-----------------------------------

All of this testing took much much much longer than the fix itself and
involved the attention of at least three of our developers before the
fix and will require at least one developer's time to code review.
I think the fix took about five minutes and the unit testing about two hours,
including writing the new functions.

The end result is very brittle, and I'm not sure it's telling us anything
that looking at the activity monitor didn't tell me (other than that it's
automated, but as I mentioned above, that's likely to bite us later due to
its brittleness).

In this case, I think we could've just code reviewed the change and
gone with it. But this brings up the next issue.

Subjective Nature of When Enough is Enough
------------------------------------------------

Another problem I see is that how much testing to do is a judgment call.
I may trust myself and code reviewers in situations like this one that
seem totally obvious to me. But are they totally obvious? It's very subjective,
and my guess is that we're all overconfident in our fixes, exactly the
problem that unit testing is trying to mitigate.

I'm not sure any of us understand Daniel's line between
enough and not-enough testing. (I'm only picking out Daniel because he's
been driving this process and we put him in charge of gatekeeping merges a
few meetings ago.)

Time Misallocated?
------------------------------------------------

What I'm really worried about is that we're misallocating
our development/testing time. Would those hours I spent adding the
new tests have been better spent doing more functional testing,
specifying interfaces for RStan/PyStan/CmdStan, or something along
those lines?

Sorry this isn't a more concrete suggestion on how to fix our process.

Maybe we can discuss at next meeting if we run out of statistical
topics.

- Bob

Ben Goodrich

unread,

Jan 1, 2014, 7:58:58 PM1/1/14

to stan...@googlegroups.com

On Wednesday, January 1, 2014 4:47:41 PM UTC-5, Bob Carpenter wrote:

Unit testing gives me a lot of time for reflection.

Summary
-----------------------------------

I feel we're missing the forest for the trees in testing.

Given our limited time budget, I think it would be better spent
with more functional specification and testing and less time
unit testing.

Somewhat related to this, I think the benefit of the current function_signatures tests is not worth the time, electricity, wear on spinning hard disks, etc. I hate these tests more than I hate pennies, which I would probably value at no more than 1/10th of a cent.

I have a local branch that adds -fsyntax-only to the $(CC) call for (most of) the function_signatures tests, meaning that what is tested is whether stanc yields a .cpp file that is syntactically correct. Relative to the current behavior, it doesn't test whether the .cpp function actually compiles at whatever level of optimization. Thus, it is true that these tests might catch a rare compiler-related bug that appears at some level of optimization, it is a horrifically inefficient way of catching such bugs. Most of the function_signatures tests call different template instantiations of the same set of functions, so the compiler is doing the same optimization passes over and over again. And most of the parser bugs we have encountered either would get caught under -fsyntax-only or are run-time things that aren't getting caught by the existing behavior.

If you do

git checkout develop
make clean-all
make CC=clang++ -j9 bin/stanc
time make CC=clang++ -j4 test/gm/compile_models > /dev/null

you spend

real    42m33.405s
user    164m32.483s
sys     1m15.659s

on these tests if you don't swap, which is why I used clang++ and -j4 (my laptop has 8GB of RAM and uses 5 to 6 in this case). If you use g++, you tend to swap more. If you use mingw, you tend to swap more. If you use a recent version of mingw, instead of the 4.6.x version that comes with Rtools, you tend to swap a lot more (which is the only reason I can think of why Rtools uses such an old version of mingw). If you use -j8 with clang++ 3.4, you still swap.

With my branch that does -fsyntax-only for most of the function_signatures tests, when doing the above you spend

real    2m24.641s
user    8m38.731s
sys     0m7.217s

and can use different compilers and more virtual cores if you want. People have stated that they haven't run test-unit locally because of swapping or they specify O=0 to avoid optimization, so we would probably catch more optimization-related bugs if we did -fsyntax-only for most of the function_signatures tests and ran test-unit more often.

Some of these tests actually execute something, so I left a few of them alone, namely

goodrich@CYBERPOWERPC:/opt/stan$ ls src/test/gm/model_specs/compiled/*.stan
src/test/gm/model_specs/compiled/assignment_double_index_lhs.stan   src/test/gm/model_specs/compiled/illegal_transformed_parameters.stan
src/test/gm/model_specs/compiled/assignments_double_var.stan        src/test/gm/model_specs/compiled/issue109.stan
src/test/gm/model_specs/compiled/assignments.stan                   src/test/gm/model_specs/compiled/issue91.stan
src/test/gm/model_specs/compiled/assignments_var.stan               src/test/gm/model_specs/compiled/mat_assign.stan
src/test/gm/model_specs/compiled/declarations.stan                  src/test/gm/model_specs/compiled/print_chars.stan
src/test/gm/model_specs/compiled/dims.stan                          src/test/gm/model_specs/compiled/print_indexing.stan
src/test/gm/model_specs/compiled/domain_fail.stan                   src/test/gm/model_specs/compiled/rngs.stan
src/test/gm/model_specs/compiled/illegal_generated_quantities.stan  src/test/gm/model_specs/compiled/value_fail.stan
src/test/gm/model_specs/compiled/illegal_transformed_data.stan

I moved the rest to a function_signatures subdirectory that gets treated differently.

Ben

Bob Carpenter

unread,

Jan 1, 2014, 8:13:27 PM1/1/14

to stan...@googlegroups.com

On 1/2/14, 1:58 AM, Ben Goodrich wrote:
> On Wednesday, January 1, 2014 4:47:41 PM UTC-5, Bob Carpenter wrote:
>
> Unit testing gives me a lot of time for reflection.
>
>
> Summary
> -----------------------------------
>
> I feel we're missing the forest for the trees in testing.
>
> Given our limited time budget, I think it would be better spent
> with more functional specification and testing and less time
> unit testing.
>
>
> Somewhat related to this, I think the benefit of the current function_signatures tests is not worth the time,
> electricity, wear on spinning hard disks, etc. I hate these tests more than I hate pennies, which I would probably value
> at no more than 1/10th of a cent.

Being a fellow hater of pennies, I love that expression!
I'm still on the fence with the 1 euro and 2 euro coins.

Speaking of electricity, these tests drain my battery so
quickly I can almost see it ticking down. And otherwise
these Macbooks run a long time on a charge.

The reason the function_signature tests are
there is that we had problems in the past with people writing functions
that could only be instantiated under some template instantiations and not
others and these got through to release versions.

They also take some time to write for each of our functions
as there's a lot of declarations and what not to cover all the
cases.

They're going to be really really slow once we try
instantiating them for higher-order auto-diff. Then we'll
have to do it only rarely if we try to do it at all.

> I have a local branch that adds -fsyntax-only to the $(CC) call for (most of) the function_signatures tests, meaning
> that what is tested is whether stanc yields a .cpp file that is syntactically correct. Relative to the current behavior,
> it doesn't test whether the .cpp function actually compiles at whatever level of optimization.

Does this test that the instantiations are correct or just
the templates? If the former, I think we should move to your
suggested form. If not, maybe we should break them out and
only run them for integration.

> Thus, it is true that
> these tests might catch a rare compiler-related bug that appears at some level of optimization, it is a horrifically
> inefficient way of catching such bugs. Most of the function_signatures tests call different template instantiations of
> the same set of functions, so the compiler is doing the same optimization passes over and over again.

The point was to test the instantiations could be compiled.

> And most of the
> parser bugs we have encountered either would get caught under -fsyntax-only or are run-time things that aren't getting
> caught by the existing behavior.

Right --- these do nothing at all to test that they
can actually be executed sensibly --- just that all the
instantiations compile.

> If you do
>
> |
> git checkout develop
> make clean-all
> make CC=clang++ -j9 bin/stanc
> time make CC=clang++ -j4 test/gm/compile_models > /dev/null
> |
>
> you spend
>
> |
> real 42m33.405s
> user 164m32.483s
> sys 1m15.659s
> |
>
> on these tests if you don't swap, which is why I used clang++ and -j4 (my laptop has 8GB of RAM and uses 5 to 6 in this
> case).

Yikes. I just run -O0 for the unit tests, and they run MUCH
faster there.

> If you use g++, you tend to swap more. If you use mingw, you tend to swap more. If you use a recent version of
> mingw, instead of the 4.6.x version that comes with Rtools, you tend to swap a lot more (which is the only reason I can
> think of why Rtools uses such an old version of mingw). If you use -j8 with clang++ 3.4, you still swap.

I only ever run -j4 because I only have 4 cores in my notebook.

> With my branch that does -fsyntax-only for most of the function_signatures tests, when doing the above you spend
>
> |
> real 2m24.641s
> user 8m38.731s
> sys 0m7.217s
> |
>
> and can use different compilers and more virtual cores if you want. People have stated that they haven't run test-unit
> locally because of swapping or they specify O=0 to avoid optimization, so we would probably catch more
> optimization-related bugs if we did -fsyntax-only for most of the function_signatures tests and ran test-unit more often.

I don't think we'll catch any optimization bugs with what we're doing
now, unless it's a compiler bug that lets it compile at -O0 and not
at -O3 or something like that.

> Some of these tests actually execute something, so I left a few of them alone, namely
>
> |
> goodrich@CYBERPOWERPC:/opt/stan$ ls src/test/gm/model_specs/compiled/*.stan
> src/test/gm/model_specs/compiled/assignment_double_index_lhs.stan
> src/test/gm/model_specs/compiled/illegal_transformed_parameters.stan
> src/test/gm/model_specs/compiled/assignments_double_var.stan src/test/gm/model_specs/compiled/issue109.stan
> src/test/gm/model_specs/compiled/assignments.stan src/test/gm/model_specs/compiled/issue91.stan
> src/test/gm/model_specs/compiled/assignments_var.stan src/test/gm/model_specs/compiled/mat_assign.stan
> src/test/gm/model_specs/compiled/declarations.stan src/test/gm/model_specs/compiled/print_chars.stan
> src/test/gm/model_specs/compiled/dims.stan src/test/gm/model_specs/compiled/print_indexing.stan
> src/test/gm/model_specs/compiled/domain_fail.stan src/test/gm/model_specs/compiled/rngs.stan
> src/test/gm/model_specs/compiled/illegal_generated_quantities.stan src/test/gm/model_specs/compiled/value_fail.stan
> src/test/gm/model_specs/compiled/illegal_transformed_data.stan
> |
>
> I moved the rest to a function_signatures subdirectory that gets treated differently.

That seems like a reasonable approach.

- Bob

Bob Carpenter

unread,

Jan 1, 2014, 8:41:10 PM1/1/14

to stan...@googlegroups.com

On 1/2/14, 1:58 AM, Ben Goodrich wrote:

> On Wednesday, January 1, 2014 4:47:41 PM UTC-5, Bob Carpenter wrote:
>
> Unit testing gives me a lot of time for reflection.
>
>
> Summary
> -----------------------------------
>
> I feel we're missing the forest for the trees in testing.
>
> Given our limited time budget, I think it would be better spent
> with more functional specification and testing and less time
> unit testing.
>
>
> Somewhat related to this, I think the benefit of the current function_signatures tests is not worth the time,
> electricity, wear on spinning hard disks, etc. I hate these tests more than I hate pennies, which I would probably value
> at no more than 1/10th of a cent.

I take it back. Even with CC=clang++ O=0, I got 26 minutes in and it's
still running. I killed it.

Maybe we already are doing higher-order signature tests.

- Bob

Ben Goodrich

unread,

Jan 1, 2014, 9:15:47 PM1/1/14

to stan...@googlegroups.com

On Wednesday, January 1, 2014 8:13:27 PM UTC-5, Bob Carpenter wrote:

> I have a local branch that adds -fsyntax-only to the $(CC) call for (most of) the function_signatures tests, meaning
> that what is tested is whether stanc yields a .cpp file that is syntactically correct. Relative to the current behavior,
> it doesn't test whether the .cpp function actually compiles at whatever level of optimization.

Does this test that the instantiations are correct or just
the templates? If the former, I think we should move to your
suggested form. If not, maybe we should break them out and
only run them for integration.

I read this

http://stackoverflow.com/questions/2634986/g-fsyntax-only-unit-test

before I made the branch, so I think we will be fine.

Ben

Marcus Brubaker

unread,

Jan 2, 2014, 1:12:39 PM1/2/14

to stan...@googlegroups.com

Bob and Ben have gone off on a tangent with this email, but I think it raises important questions so...

On Wed, Jan 1, 2014 at 4:47 PM, Bob Carpenter <ca...@alias-i.com> wrote:

Unit testing gives me a lot of time for reflection.

Summary
-----------------------------------

I feel we're missing the forest for the trees in testing.

Given our limited time budget, I think it would be better spent
with more functional specification and testing and less time
unit testing.

Perhaps it goes without saying but.... +1

Indeed, I worry about this a lot. Some things are worth testing and can be tested well but not everything. While having more unit tests seems like it can only be a good thing, it potentially makes large scale changes down the line impossible or incredibly time consuming to do.

Plus, all this work on low-level unit testing hasn't caught the most important kinds of bugs because we haven't gotten around to writing more high-level tests (e.g,. checking convergence or estimates from sampling).

Subjective Nature of When Enough is Enough
------------------------------------------------

Another problem I see is that how much testing to do is a judgment call.
I may trust myself and code reviewers in situations like this one that
seem totally obvious to me. But are they totally obvious? It's very subjective,
and my guess is that we're all overconfident in our fixes, exactly the
problem that unit testing is trying to mitigate.

I'm not sure any of us understand Daniel's line between
enough and not-enough testing. (I'm only picking out Daniel because he's
been driving this process and we put him in charge of gatekeeping merges a
few meetings ago.)

I know I have struggled with this as well. I think we need to come to a better consensus as a project about what "enough testing" means.

In this case, Bob ended up writing a unit test simply because the initial implementation had a bug. I understand the need/value for regression tests but that's different from a unit test and it's unlikely that this specific bug will ever remanifest unless someone completely rewrites the code and if that happens it will most likely break the test anyway.

I think testing needs to be prioritized based on what's most important for Stan as a project. For instance, I would rank functionality/bugs based on their urgency in the following way:

1. Statistical correctness is most urgent. Anything which risks invalidating the results of a sufficiently long Markov chain. Examples include correctness of PDFs or errors in the sampler.

2. Statistical efficiency is second most urgent. Anything which slows sampling, particularly to the extent that it may lead to a user misinterpreting convergence, therefore indirectly impacting correctness. Examples of this include correctness of gradients, bugs in adaptation, etc.

3. Crashes, memory leaks, C++ interfaces and other bugs. These things are clearly bad but if a user hits them, they generally know and there isn't a risk of incorrect inferences.

4. User-visible interfaces. E.G., commands do what they're supposed to, input files are read properly, output files are in the right format, etc.

My view is that our test suite should be targeted towards 1 and 2 and, to a lesser extent, 4. Bugs and whatnot that come out of 3 are important to be fixed, but extensive unit tests aren't needed and regression tests should only be written for subtle or hard-to-notice issues. Issues in 4 are important from a user-friendliness perspective, but I think for Stan priority #1 must be statistical correctness and efficiency and our efforts need to be prioritized accordingly.

(On a side note, if we wanted to test issues like 3 well, we should simply run all our tests/models through valgrind/dmalloc/whatever once in a while and ensure everything comes up clean.)

Time Misallocated?
------------------------------------------------

What I'm really worried about is that we're misallocating
our development/testing time. Would those hours I spent adding the
new tests have been better spent doing more functional testing,
specifying interfaces for RStan/PyStan/CmdStan, or something along
those lines?

Sorry this isn't a more concrete suggestion on how to fix our process.

Maybe we can discuss at next meeting if we run out of statistical
topics.

Personally I'd rather we spend some of the effort we direct towards low-level unit tests on high-level behaviour tests. Those things matter a lot more and, if they were done, would allow us to work with less testing and more confidence elsewhere.

Cheers,

Marcus

Michael Betancourt

unread,

Jan 2, 2014, 2:48:08 PM1/2/14

to stan...@googlegroups.com

I'm going to have to cast a vote in favor of our push for more unit tests.

Firstly, "more functionality-less tests" is really a false option. Without tests
bugs sneak through, causing flawed releases that require a huge amount
of developer time to investigate, fix, and then update. Not to mention the
potential breakdown of trust from our users. I say this from experience,
loosing entire weekends tracking down bugs.

Secondly, unit tests offer far more than explicit tests. They also offer an
example for how objects should be instantiated and used. In our
"deprecated comments is a greater sin than no comments" philosophy
these examples are critical to understanding how someone else's code
is supposed to work and hence how it can be integrated/extended. It seems
like half of our debugging time is spent just trying to deduce another developer's
intentions.

Thirdly, higher-level tests are extremely limited. They can't identify problems
beyond the tested models (and there's no way we'd be able to fully test the space
of all models) and even when a problem is identified we have no guidance in
where the problems are occurring. Yes, a bisect might be able to identify the
offending commit but it wouldn't identify interactions between the new and old
code that induce the problems.

tl;dr Unit tests not only identify bugs they make debugging easier and the
integration of new code far less difficult. This will not change with the addition of
higher-level integration tests or statistical tests.

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Jan 2, 2014, 5:47:13 PM1/2/14

to stan...@googlegroups.com

And therein lies the problem. We're trying to balance two
good things against not having enough time to do everything!

More inline below.

On 1/2/14, 8:48 PM, Michael Betancourt wrote:
> I'm going to have to cast a vote in favor of our push for more unit tests.
>
> Firstly, "more functionality-less tests" is really a false option. Without tests
> bugs sneak through, causing flawed releases that require a huge amount
> of developer time to investigate, fix, and then update. Not to mention the
> potential breakdown of trust from our users. I say this from experience,
> loosing entire weekends tracking down bugs.

I'm particularly worried about trust. Both in ourselves and
from our users.

The problem I see is that as we crawl up the functionality and modularity
hierarchy from individual mathematical functions defined in stan::math to
the command-line code, it gets harder and harder to unit test. The complete
spec is less clearly defined, and testing little pieces requires extensive
mock-object frameworks. As far as I know, we don't have any tests that the
right samplers are getting called with the right command-line args, and such
tests are complex to write if we want to do it functionally, because we have
to know how each of the samplers behaves w.r.t. some fixed input.

> Secondly, unit tests offer far more than explicit tests. They also offer an
> example for how objects should be instantiated and used. In our
> "deprecated comments is a greater sin than no comments" philosophy
> these examples are critical to understanding how someone else's code
> is supposed to work and hence how it can be integrated/extended. It seems
> like half of our debugging time is spent just trying to deduce another developer's
> intentions.

If you are referring to my distaste for code comments, my preference is
for better structured code (i.e., more modular code) with reasonable naming conventions
so it's easier to follow. For instance, I'm in command.hpp right now, and there
are Fortran-style comments in functions like this:

//////////////////////////////////////////////////
// Initialize Model //
//////////////////////////////////////////////////

to point out that the following block of code initializes the model. I'd far
prefer a function named initialize_model(...) that encapsulated all of that code.

Functions should be short and scannable, not 700 lines of nested loops!

The problem (there's always a problem!) is that the first line is:

Model model(data_var_context, &std::cout);

and that variable "model" is going to be needed outside the loop. So if that
initialization is done internal the function, you have to define a nullary
constructor for Model and pass in a reference to model to set along with
the arguments. It can quickly get ugly. In this case, it makes sense to
define the model outside the initialization code, but I hope the general problem's
clear enough.

Also, I believe very strongly in API comments that indicate what the inputs and
outputs of functions are and how the outputs relate to the inputs. The unit
tests can be used as stand-ins for the API comments, but I think the right way round
is to doc the API, then test vs. the doc. Including boundary conditions. But
this also gets much harder as we crawl up the functionality hierarchy.

Code comments should be reserved, in my opinion, for when you do something non-idiomatic
in the language or when there's a very complex algorithm the steps of which need
to be explained.

> Thirdly, higher-level tests are extremely limited. They can't identify problems
> beyond the tested models (and there's no way we'd be able to fully test the space
> of all models) and even when a problem is identified we have no guidance in
> where the problems are occurring. Yes, a bisect might be able to identify the
> offending commit but it wouldn't identify interactions between the new and old
> code that induce the problems.

Agreed. And unit tests are extremely limited in exactly the opposite way --- almost
by definition, they test only micro behavior, not macro behavior.

> tl;dr Unit tests not only identify bugs they make debugging easier and the
> integration of new code far less difficult. This will not change with the addition of
> higher-level integration tests or statistical tests.

Agreed. The only problem is that we don't seem to get around to writing
the higher-level tests because we're all (or at least I am) buried in the
details of unit testing.

In the end, I think it's a judgment call. I don't think we want to be dogmatic
about demanding tests for every single function at every level of abstraction.
We just don't have the programmer-hours to make progress this way.

Then again, maybe we don't have the programmer hours not to :-)

- Bob

Michael Betancourt

unread,

Jan 2, 2014, 9:09:58 PM1/2/14

to stan...@googlegroups.com

A few quick points:

> The problem I see is that as we crawl up the functionality and modularity
> hierarchy from individual mathematical functions defined in stan::math to
> the command-line code, it gets harder and harder to unit test. The complete
> spec is less clearly defined, and testing little pieces requires extensive
> mock-object frameworks. As far as I know, we don't have any tests that the
> right samplers are getting called with the right command-line args, and such
> tests are complex to write if we want to do it functionally, because we have
> to know how each of the samplers behaves w.r.t. some fixed input.

This will definitely be improved when we refactor everything. Unit tests are hard
with poor specs, but that's a problem with the spec not unit-tests. I'm with Dan
in formalizing a crappy spec to get a first round of tests and then building up from there.

> If you are referring to my distaste for code comments, my preference is
> for better structured code (i.e., more modular code) with reasonable naming conventions
> so it's easier to follow. For instance, I'm in command.hpp right now, and there
> are Fortran-style comments in functions like this:
>
> //////////////////////////////////////////////////
> // Initialize Model //
> //////////////////////////////////////////////////
>
> to point out that the following block of code initializes the model.

So I LOVE these comments, but only because they make the relevant part of the code
easy to find quickly. People will argue that command.hpp is way to long -- yeah,
it probably is -- but given the length these kinds of comments are helpful.

Even trivial comments on top of functions are useful as the color syntax from any decent
editor will make the transition between functions pop visually. I still find the longer Stan
files hard to parse.

> Also, I believe very strongly in API comments that indicate what the inputs and
> outputs of functions are and how the outputs relate to the inputs. The unit
> tests can be used as stand-ins for the API comments, but I think the right way round
> is to doc the API, then test vs. the doc. Including boundary conditions. But
> this also gets much harder as we crawl up the functionality hierarchy.
>

> Agreed. And unit tests are extremely limited in exactly the opposite way --- almost
> by definition, they test only micro behavior, not macro behavior.
>

> Agreed. The only problem is that we don't seem to get around to writing
> the higher-level tests because we're all (or at least I am) buried in the
> details of unit testing.

We're in a hard place right now because we're catching up for a long time without
tests. It will continue to suck as we transition, but tests will be easier as we
formalize specs and continue the internal refactors and eventually (sooner rather
than later, I my opinion) we'll be in a much better place where we can get back
to focusing on fun code. It will also be easier to introduce new developers and
external pull requests.

These are good conversations to be having!

Marcus Brubaker

unread,

Jan 3, 2014, 7:21:59 PM1/3/14

to stan...@googlegroups.com

To be clear, I don't think we should do *less* testing. I just think we need to balance where our time is spent because I think everyones experience recently has been that we've spent a *ton* of time working on tests relative to very little time working on either bug fixes or new features. Perhaps part of the solution is that we need to be smarter about *how* we unit test things. "Test smarter, not harder" to paraphrase...

Let me give an example of what I mean with finite-difference gradient testing. Right now we have finite-difference tests sprinkled all over the place. In all of them we re-implement the same tedious steps of computing the finite difference, computing the agrad version and then comparing for some tolerance. We clearly need a finite difference framework such that these tests would be 3-5 lines of code, tops, instead of 20 to 30. Moreover, we have a significant collection of models and a command-line tool which computes finite-differences of the model log prob. it should be a relatively small amount of work to actually check these as part of our unit tests. If you just did that, without *any* specific function tests, we'd automatically get decent gradient test coverage.

I think similar arguments could be made with how we test, e.g., distribution log prob functions.

In short, right now part of why writing tests takes so much time is that it takes a lot of time to write tests. If we make it easier to write the tests then hopefully we'll spend less time doing it.

Cheers,

Marcus

Bob Carpenter

unread,

Jan 3, 2014, 8:05:02 PM1/3/14

to stan...@googlegroups.com

I'm not writing dumb tests because I think it's a good idea
to write dumb tests!

I've been meaning to write a finite-diff test framework,
so I just created an issue to do so:

https://github.com/stan-dev/stan/issues/483

I'm not sure it's going to save a lot of time given the need for
wrappers. Feel free to edit if you have better ideas of how to
do this in general.

An example of where I can't see how to test well is in the suggestion
to catch 5 different types of exceptions in 6 different catch
blocks (or to have a single block with a bunch of runtime tests for type).
I don't see how to exercise those 30 code blocks (or 6 blocks with 5 conditions
each) in any straightforward way. The last pull request I made was also malformed in
that I didn't even try to test. It seems like a minimum requirement of testing
that we get full code coverage. Am I overthinking this?

Same thing for all this memory leak stuff. If someone can suggest
a general way to test for memory leaks that doesn't depend on
the function being tested and its internal data sizes and that doesn't
take forever to run in order to uncover the leak, I'd be happy to
implement it.

- Bob

> stan-dev+u...@googlegroups.com <mailto:stan-dev%2Bunsu...@googlegroups.com>.

Marcus Brubaker

unread,

Jan 3, 2014, 8:21:00 PM1/3/14

to stan...@googlegroups.com

Not trying to say anyone is writing dumb tests! I think maybe we've taken the "path of least resistance" in writing tests a few too many times (myself certainly included) and things like this are the result: we duplicate effort and the difficulty of writing tests remains the same.

Regarding the try/catch stuff in that pull request, this is one place where I think it's reasonable to draw a line in the sand and not test everything we can possibly think of. In my mind this is a place where some good comments, explaining why the code is written that way, would go a long ways to ensuring that it doesn't break. Add to that a regression test which triggers the catch and, to me, that's sufficient.

Cheers,

Marcus

stan-dev+unsubscribe@googlegroups.com <mailto:stan-dev%2Bunsu...@googlegroups.com>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward