Testing top-level services commands

24 views
Skip to first unread message

Krzysztof Sakrejda

unread,
Aug 27, 2016, 8:17:02 AM8/27/16
to stan development mailing list
For testing top-level stan::services commands for the refactor, it seems like the most correct way to go would be to take snapshots of current outputs and expect the new commands to produce those in tests. Is that taking it too far? I guess I could look for current tests on top-level commands but I'm not sure if we have those.

The optimization command tests Daniel wrote has some known correct output for a simple model so I guess I could do that but with HMC it's more involved. Any unstated (or stated) policy on how we want to do this? I'll have some more time for writing tests next week. I have some current ones for the HMC commands but it's only the scaffolding for the tests---I didn't push because I wasn't sure what output to test against.

Krzysztof

Bob Carpenter

unread,
Aug 27, 2016, 9:48:18 AM8/27/16
to stan...@googlegroups.com
We've had this discussion before and have been split
in our opinions. To summarize, the problem with snapshot
tests is that they don't really independently test
anything at the point they are created. The utility is
that they will catch any later regression from that behavior.
That also means any intentional changes need new shapshots.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Krzysztof Sakrejda

unread,
Aug 27, 2016, 11:21:27 AM8/27/16
to stan development mailing list
I guess for now I could do snapshot comparisons with what the commands are currently producing for the stuff that's easy to verify (text output, number of iterations, verifying the interrupt callback gets called).

Could also add some tests of simulating from basic distributions (normal and uniform?) and comparing with theoretical quantiles to a few decimal places of accuracy. That would catch things like the problem we had with 2.10 and shouldn't involve too much computation.

K

Michael Betancourt

unread,
Aug 27, 2016, 11:39:48 AM8/27/16
to stan...@googlegroups.com

> I guess for now I could do snapshot comparisons with what the commands are currently producing for the stuff that's easy to verify (text output, number of iterations, verifying the interrupt callback gets called).

Just don’t make anything depend on specifics, like exact numbers or text. Checking that commands
run without throwing errors, or that errors do get thrown, is all good.


> Could also add some tests of simulating from basic distributions (normal and uniform?) and comparing with theoretical quantiles to a few decimal places of accuracy. That would catch things like the problem we had with 2.10 and shouldn't involve too much computation.

A statistical validation framework was written long ago: https://github.com/stan-dev/stan/tree/feature/stat_valid_test.
If you want to incorporate it into this refactor then go for it.

Krzysztof Sakrejda

unread,
Aug 27, 2016, 5:07:01 PM8/27/16
to stan development mailing list
Daniel's test of the optim service command depended on some exact output so that's what I went with. Does this need to get addressed at a meeting?

Bob Carpenter

unread,
Aug 27, 2016, 5:19:18 PM8/27/16
to stan...@googlegroups.com
No, it just needs to be done better in the future.
I think everyone's on the same page.

- Bob

> On Aug 27, 2016, at 11:07 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> Daniel's test of the optim service command depended on some exact output so that's what I went with. Does this need to get addressed at a meeting?
>

Daniel Lee

unread,
Aug 28, 2016, 9:42:05 AM8/28/16
to stan-dev mailing list
+1. No, it doesn't and shouldn't be done the way I did it, but... the reason I did it that way was because there aren't tests all the way down, so the only way I knew how to prevent behavior changes in an automated test framework was to capture everything and make sure everything looks the same after the change. (It allowed me to move faster than anything else I knew how to do.)

Ideally, we would have tests at the lower levels. At this level, we want to make sure the model gets called, the writers get called, etc. but not necessarily with any particular output. We should be able to mock them and then check that the mocks get called.

With that said, we should have another set of tests for regression testing. If this is truly the API, then we should have some sort of regression testing on the API in addition to unit tests.


That wasn't very constructive, so I'll try to add something that we can do:
- add unit tests. Just check that the model gets called, the writers get called, inputs are respected, etc. The tests should live here: src/test/unit/services/**
- add regression tests. We should check that behavior is the same, so same output, etc. So the snapshots. Add a new folder for these: src/test/regression/services/**. One thing to consider is that we should have an easy way to serialize the output for testing. That'll make updating these easier. I still think we should be aware of when the behavior changes, but dealing with it shouldn't be too hard.

Thoughts? I think we're all on the same page here. Michael's tests should go somewhere, but they're also bigger than unit tests and not quite regression tests.


Daniel


On Sat, Aug 27, 2016 at 5:18 PM, Bob Carpenter <ca...@alias-i.com> wrote:
No, it just needs to be done better in the future.
I think everyone's on the same page.

- Bob

> On Aug 27, 2016, at 11:07 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> Daniel's test of the optim service command depended on some exact output so that's what I went with.  Does this need to get addressed at a meeting?
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Aug 28, 2016, 9:52:18 AM8/28/16
to stan...@googlegroups.com
I don't think every character we output should be considered
part of the API. That's what I object to here. If we
test exact values, we screw up with different compilers.
If we test spacing, etc., then minor typo changes have massive
repercussions on testing (this has bitten me more than once
in the past).

I think what we want to do is make the output testing more
unit-like as I've tried to do with the parser. That is,
don't test the entire stream of warning messages from the parser,
just make sure that a unit of that output (like a warning message)
shows up. That can be tested for exact text, because it is more
modular and changes have local tests.

- Bob


> On Aug 28, 2016, at 3:42 PM, Daniel Lee <bea...@alum.mit.edu> wrote:
>
> +1. No, it doesn't and shouldn't be done the way I did it, but... the reason I did it that way was because there aren't tests all the way down, so the only way I knew how to prevent behavior changes in an automated test framework was to capture everything and make sure everything looks the same after the change. (It allowed me to move faster than anything else I knew how to do.)
>
> Ideally, we would have tests at the lower levels. At this level, we want to make sure the model gets called, the writers get called, etc. but not necessarily with any particular output. We should be able to mock them and then check that the mocks get called.
>
> With that said, we should have another set of tests for regression testing. If this is truly the API, then we should have some sort of regression testing on the API in addition to unit tests.
>
>
> That wasn't very constructive, so I'll try to add something that we can do:
> - add unit tests. Just check that the model gets called, the writers get called, inputs are respected, etc. The tests should live here: src/test/unit/services/**
> - add regression tests. We should check that behavior is the same, so same output, etc. So the snapshots. Add a new folder for these: src/test/regression/services/**. One thing to consider is that we should have an easy way to serialize the output for testing. That'll make updating these easier. I still think we should be aware of when the behavior changes, but dealing with it shouldn't be too hard.
>
> Thoughts? I think we're all on the same page here. Michael's tests should go somewhere, but they're also bigger than unit tests and not quite regression tests.
>
>
> Daniel
>
>
> On Sat, Aug 27, 2016 at 5:18 PM, Bob Carpenter <ca...@alias-i.com> wrote:
> No, it just needs to be done better in the future.
> I think everyone's on the same page.
>
> - Bob
>
> > On Aug 27, 2016, at 11:07 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> >
> > Daniel's test of the optim service command depended on some exact output so that's what I went with. Does this need to get addressed at a meeting?
> >
> > --
> > You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.

Daniel Lee

unread,
Aug 28, 2016, 10:20:48 AM8/28/16
to stan-dev mailing list
On Sun, Aug 28, 2016 at 9:51 AM, Bob Carpenter <ca...@alias-i.com> wrote:
I don't think every character we output should be considered
part of the API.  That's what I object to here.  If we
test exact values, we screw up with different compilers.
If we test spacing, etc., then minor typo changes have massive
repercussions on testing (this has bitten me more than once
in the past).

+1

The one problem we have (and this isn't a hypothetical problem) is that some of the interfaces are parsing through the output by character / lines and using that. We found out that StataStan does this and so does RStan for some of its parsing of input.

 

I think what we want to do is make the output testing more
unit-like as I've tried to do with the parser.  That is,
don't test the entire stream of warning messages from the parser,
just make sure that a unit of that output (like a warning message)
shows up.  That can be tested for exact text, because it is more
modular and changes have local tests.

Yes. These are key and we don't have them everywhere, but we should. (Not just for the parser, but for other places where we throw errors.)

 

- Bob


> On Aug 28, 2016, at 3:42 PM, Daniel Lee <bea...@alum.mit.edu> wrote:
>
> +1. No, it doesn't and shouldn't be done the way I did it, but... the reason I did it that way was because there aren't tests all the way down, so the only way I knew how to prevent behavior changes in an automated test framework was to capture everything and make sure everything looks the same after the change. (It allowed me to move faster than anything else I knew how to do.)
>
> Ideally, we would have tests at the lower levels. At this level, we want to make sure the model gets called, the writers get called, etc. but not necessarily with any particular output. We should be able to mock them and then check that the mocks get called.
>
> With that said, we should have another set of tests for regression testing. If this is truly the API, then we should have some sort of regression testing on the API in addition to unit tests.
>
>
> That wasn't very constructive, so I'll try to add something that we can do:
> - add unit tests. Just check that the model gets called, the writers get called, inputs are respected, etc. The tests should live here: src/test/unit/services/**
> - add regression tests. We should check that behavior is the same, so same output, etc. So the snapshots. Add a new folder for these: src/test/regression/services/**. One thing to consider is that we should have an easy way to serialize the output for testing. That'll make updating these easier. I still think we should be aware of when the behavior changes, but dealing with it shouldn't be too hard.
>
> Thoughts? I think we're all on the same page here. Michael's tests should go somewhere, but they're also bigger than unit tests and not quite regression tests.
>
>
> Daniel
>
>
> On Sat, Aug 27, 2016 at 5:18 PM, Bob Carpenter <ca...@alias-i.com> wrote:
> No, it just needs to be done better in the future.
> I think everyone's on the same page.
>
> - Bob
>
> > On Aug 27, 2016, at 11:07 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> >
> > Daniel's test of the optim service command depended on some exact output so that's what I went with.  Does this need to get addressed at a meeting?
> >
> > --
> > You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Aug 28, 2016, 10:28:19 AM8/28/16
to stan...@googlegroups.com

> On Aug 28, 2016, at 4:20 PM, Daniel Lee <bea...@alum.mit.edu> wrote:
>
> On Sun, Aug 28, 2016 at 9:51 AM, Bob Carpenter <ca...@alias-i.com> wrote:
> I don't think every character we output should be considered
> part of the API. That's what I object to here. If we
> test exact values, we screw up with different compilers.
> If we test spacing, etc., then minor typo changes have massive
> repercussions on testing (this has bitten me more than once
> in the past).
>
> +1
>
> The one problem we have (and this isn't a hypothetical problem) is that some of the interfaces are parsing through the output by character / lines and using that. We found out that StataStan does this and so does RStan for some of its parsing of input.

Can't we just let the upstream interfaces fend for themselves?

What output's being parsed character by character by the interfaces?
I thought RStan worked by callback now and didn't take the text output
from the CmdStan routines.

I'm less concerned about StataStan and other command wrappers. Rather
than grabbing out text output, they should be taking the CSV output
and using a CSV parser, and for other output, just dumping it through
as-is. At least I think that's what they should be doing, but I havne't
thought it through deeply. We really need to make the output of
CmdStan less monolothic, such as putting the config output in standard
formats like CSV or JSON (so they can get config, get adaptation, and
get results).

OK, maybe we do need to discuss this at the meeting. All of these
dependencies on unpublished and undocumented parts of the API is
worrying. It basically locks us into the current broken way of doing things.

- Bob
Message has been deleted

Krzysztof Sakrejda

unread,
Aug 28, 2016, 2:19:03 PM8/28/16
to stan development mailing list
On Sunday, August 28, 2016 at 9:42:05 AM UTC-4, Daniel Lee wrote:
> +1. No, it doesn't and shouldn't be done the way I did it, but... the reason I did it that way was because there aren't tests all the way down, so the only way I knew how to prevent behavior changes in an automated test framework was to capture everything and make sure everything looks the same after the change. (It allowed me to move faster than anything else I knew how to do.)
>
>
> Ideally, we would have tests at the lower levels. At this level, we want to make sure the model gets called, the writers get called, etc. but not necessarily with any particular output. We should be able to mock them and then check that the mocks get called.
>
>
> With that said, we should have another set of tests for regression testing. If this is truly the API, then we should have some sort of regression testing on the API in addition to unit tests.
>
>
>
>
> That wasn't very constructive, so I'll try to add something that we can do:
> - add unit tests. Just check that the model gets called, the writers get called, inputs are respected, etc. The tests should live here: src/test/unit/services/**
> - add regression tests. We should check that behavior is the same, so same output, etc. So the snapshots. Add a new folder for these: src/test/regression/services/**. One thing to consider is that we should have an easy way to serialize the output for testing. That'll make updating these easier. I still think we should be aware of when the behavior changes, but dealing with it shouldn't be too hard.

Thanks Daniel, the overview and more specific instructions are very helpful. I'll modify what I have for unit tests along these lines. I'll see if I can make sense of where we have and don't have tests deeper down. It would really help to have this kind of stuff in a wiki.

> Thoughts? I think we're all on the same page here. Michael's tests should go somewhere, but they're also bigger than unit tests and not quite regression tests.

I agree with the scheme you have set out. Michael's tests could just go in src/test/statistical_validity/* (if we're ok with verbose) and I think they require a directory for output as well. I took a look at the stat_valid_test branch and it doesn't look like it would be too much work to update it to this branch. That seems like it should be its own pull request though. If this branch can go in with unit/regression tests it'll be a win all around.

Krzysztof

>
>
>
>
> Daniel
>
>
>
>
> On Sat, Aug 27, 2016 at 5:18 PM, Bob Carpenter <ca...@alias-i.com> wrote:
> No, it just needs to be done better in the future.
>
> I think everyone's on the same page.
>
>
>
> - Bob
>
>
>
>
>
> > On Aug 27, 2016, at 11:07 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> >
>
> > Daniel's test of the optim service command depended on some exact output so that's what I went with.  Does this need to get addressed at a meeting?
>
> >
>
> > --
>
> > You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
>
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.

Daniel Lee

unread,
Aug 28, 2016, 5:05:16 PM8/28/16
to stan-dev mailing list
On Sun, Aug 28, 2016 at 2:19 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
On Sunday, August 28, 2016 at 9:42:05 AM UTC-4, Daniel Lee wrote:
> That wasn't very constructive, so I'll try to add something that we can do:
> - add unit tests. Just check that the model gets called, the writers get called, inputs are respected, etc. The tests should live here: src/test/unit/services/**
> - add regression tests. We should check that behavior is the same, so same output, etc. So the snapshots. Add a new folder for these: src/test/regression/services/**. One thing to consider is that we should have an easy way to serialize the output for testing. That'll make updating these easier. I still think we should be aware of when the behavior changes, but dealing with it shouldn't be too hard.

Thanks Daniel, the overview and more specific instructions are very helpful.  I'll modify what I have for unit tests along these lines.  I'll see if I can make sense of where we have and don't have tests deeper down.  It would really help to have this kind of stuff in a wiki.

Feel free to pull whatever needs to be in a wiki. I have trouble getting started and having a scope of what should be in a document. If you have an idea of what you want in a doc, I can write it.
 

> Thoughts? I think we're all on the same page here. Michael's tests should go somewhere, but they're also bigger than unit tests and not quite regression tests.

I agree with the scheme you have set out.  Michael's tests could just go in src/test/statistical_validity/* (if we're ok with verbose) and I think they require a directory for output as well.  I took a look at the stat_valid_test branch and it doesn't look like it would be too much work to update it to this branch.  That seems like it should be its own pull request though.  If this branch can go in with unit/regression tests it'll be a win all around.

Yes. Feel free to start with just the unit / regression tests.


Daniel Lee

unread,
Aug 28, 2016, 5:08:22 PM8/28/16
to stan-dev mailing list
On Sun, Aug 28, 2016 at 10:27 AM, Bob Carpenter <ca...@alias-i.com> wrote:

> On Aug 28, 2016, at 4:20 PM, Daniel Lee <bea...@alum.mit.edu> wrote:
>
> On Sun, Aug 28, 2016 at 9:51 AM, Bob Carpenter <ca...@alias-i.com> wrote:
> I don't think every character we output should be considered
> part of the API.  That's what I object to here.  If we
> test exact values, we screw up with different compilers.
> If we test spacing, etc., then minor typo changes have massive
> repercussions on testing (this has bitten me more than once
> in the past).
>
> +1
>
> The one problem we have (and this isn't a hypothetical problem) is that some of the interfaces are parsing through the output by character / lines and using that. We found out that StataStan does this and so does RStan for some of its parsing of input.

Can't we just let the upstream interfaces fend for themselves?

Yes. If we decide that's what we're going to do.
 

What output's being parsed character by character by the interfaces?
I thought RStan worked by callback now and didn't take the text output
from the CmdStan routines.

I'm not 100% sure. I'll find out Monday when I work through RStan.



I'm less concerned about StataStan and other command wrappers.  Rather
than grabbing out text output, they should be taking the CSV output
and using a CSV parser, and for other output, just dumping it through
as-is.  At least I think that's what they should be doing, but I havne't
thought it through deeply.  We really need to make the output of
CmdStan less monolothic, such as putting the config output in standard
formats like CSV or JSON (so they can get config, get adaptation, and
get results).


Agreed.
 

OK, maybe we do need to discuss this at the meeting.  All of these
dependencies on unpublished and undocumented parts of the API is
worrying.  It basically locks us into the current broken way of doing things.

Yeah. We're pretty locked in due to differences in RStan and PyStan. With this refactor, those will line up with CmdStan and things will be much easier. And I'd be happy moving quicker with changes to interfaces once we've gone through with these changes.

 

Bob Carpenter

unread,
Aug 28, 2016, 9:54:18 PM8/28/16
to stan...@googlegroups.com

> On Aug 28, 2016, at 8:19 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> ...
> I agree with the scheme you have set out. Michael's tests could just go in src/test/statistical_validity/* (if we're ok with verbose) and I think they require a directory for output as well. I took a look at the stat_valid_test branch and it doesn't look like it would be too much work to update it to this branch. That seems like it should be its own pull request though. If this branch can go in with unit/regression tests it'll be a win all around.

Yes, please keep the pull requests modular so that
it's easier to review them.

- Bob

Bob Carpenter

unread,
Aug 28, 2016, 9:56:18 PM8/28/16
to stan...@googlegroups.com

> On Aug 28, 2016, at 11:08 PM, Daniel Lee <bea...@alum.mit.edu> wrote:
> ...
> Yeah. We're pretty locked in due to differences in RStan and PyStan. With this refactor, those will line up with CmdStan and things will be much easier. And I'd be happy moving quicker with changes to interfaces once we've gone through with these changes.

What refactor are you talkin about? So many floating around
that I'm losing track.

- Bob

Reply all
Reply to author
Forward
0 new messages