Stability of our code base, new vs. existing features

Daniel Lee

unread,

Nov 29, 2013, 12:53:52 AM11/29/13

to stan...@googlegroups.com

Happy thanksgiving, all!

I just wanted to address Micheal's last response from his last email. What I see are differences in opinion about the standards of the codebase. I suggest we come to an understanding, write it down clearly, and move on. (Mike, I realize that I'm singling out this last paragraph, but I think it's really clarified the differences in my head.)

***

Bob and I are treating Stan as a real project, where we have some responsibility to our users. It is no longer academic code used for a proof of concept. That's why we, as a group, put in processes. If you recall, our code base wasn't always stable. It didn't matter when the only users were the developers. Now, instability in our code base results in a snowball effect.

We do encourage *new* features that don't create problems. By problems I mean:

- code errors like seg faults, infinite loops, incorrect gradient calculations, etc.

- statistical errors like bad sampling

- poor documentation leading to lots of user confusion

- ability to support

The reason we want to keep the number of features low is our ability to support all the features.

To the best of our abilities, *existing* features should get better, not worse. Once we've committed to a feature, we have this responsibility. As we continue to move forward, we should be fixing problems and making these features better.

***

With that said, I want to comment on Micheal's response.

On Thu, Nov 28, 2013 at 7:34 PM, Michael Betancourt <betan...@gmail.com> wrote:

> Or maybe we need to test on some other models? As Michael said, these
> models are all relatively easy to fit.

These tests were only to validate that the new code wasn't messing anything
up, not to motivate its existence. If every algorithmic change gets micromanaged
like this we're going to very quickly run end up with stagnant samplers.

This paragraph is loaded, so I'll just toss things in a list:

"new code" -- the changes were to the existing, default behavior, not a new feature.
"These tests were only to validate that the new code wasn't messing anything up" -- I don't believe the tests validated that the changes weren't messing anything up. The changes didn't dominate in the (mean - expected) / MSCE plots and the changes was slower. If it dominated the (mean - expected) / MCSE, I would have been convinced that it improves the code base. If that were the same, but the changes made it faster, I would have been convinced that it improves the code base.
"... not to motivate its existence" -- if you have concrete examples that motivate its existence, can you share them? If those examples clearly show that these changes improve the existing default behavior, then we can discuss whether that out-weights the speed issue.
"micromanaged" -- I get it, you're frustrated. I really don't want you to be, but understand that when shit hits the fan, both Bob and I feel responsible for our users. We stop what we're doing to get it fixed, even if we didn't create the problem. That loss of productivity frustrates me more than process because it really kills what I want to do. I don't want to clean up after members of our team. I'd rather things get done right the first time.
"If every algorithmic change" -- this is not being singled out. Since we put the process in, we've been looking at every change to our code base. We've checked every change to the best of our abilities. This change is no different.
"... gets micromanaged like this..." -- this branch changes the behavior of the default sampler and even when there isn't a default. That's why we're looking at it so closely. If this change doubles the runtime of everyone's models, that's not good unless it fixes something wrong with the way the current sampler works. If this didn't change the defaults, but allowed the changes to be run by changing settings on the command line, I would have let it into our code base a long time ago. This change is being looked at closely because these changes are forced on all users. We should be really confident that this is an improvement before changing the behavior of Stan.
"... end up with stagnant samplers" -- if this were an additional sampler, separate from our existing samplers, it would have passed already. The acid tests would have convinced me that we're not going to have seg faults, infinite loops, or real sampling issues. However, given that you're trying to change what everyone is using, the acid tests don't clearly show that it improves anything except a couple outliers. It does, however, show a setback in speed.
"stagnant samplers" -- personally, I'd rather have stagnant samplers over a stagnant project. If we start having to fight fires all the times due to instability in our code base, the project will die. Of course, we can't have it go the other way.
Suggestions: 1) show how the changes improve the defaults and why the defaults should be changed or 2) make this a new sampler (not the default)
I'm not trying to frustrate you. Neither is Bob. And I'm sure you're not trying to frustrate me. But you've got to understand, since you're trying to change existing behavior, the burden is on you to show how something is better. The same has held true for a lot of the pull requests we've gotten in. Especially for our bug fixes. So we're all playing by the same rules.
In retrospect, the arg refactor branch shouldn't have gone in as-is. Michael, you put a lot of time into it. I was deep in it to the point where the code all made sense. I was really frustrated that we couldn't improve on Stan v1.3.0, so I pushed for it without fully evaluating the performance. That's on me. Unfortunately, we (Stan) can't really do anything about it now except to prevent something like that from happening again. Anyway, all I wanted to say is don't expect large changes to get in without evaluation. That's for everyone (myself included).

***

What to do going forward? I think we should agree on how we treat our code base. Perhaps putting in process without properly motivating was wrong. I assumed (incorrectly) that the motivation for the processes was clear.

Daniel

On Thu, Nov 28, 2013 at 7:34 PM, Michael Betancourt <betan...@gmail.com> wrote:

> Are the MCSE plots absolute or normalized by expected parameter
> size? I ask because I'm surprised there isn't more range given the wide
> range of param values. Or are all these models standardized in some way?

Absolute. The parameters plotting are those with expected values
in the tests so any standardization is incidental.

> What is the plot labeled "Variances"?

Those are the marginal variances used for the mass matrix elements.
The fact that they're the same indicates that the adaptation parameters
are not changing.

> If we don't trust the expected values, how am I supposed to
> interpret (mean - expected)/MCSE? I see no reason why the
> flex adaptation wouldn't be doing better because of the lower
> step size. Can I trust the MCSE values?

Yes, there are no evident pathologies to suggest otherwise.

> Good cases for
> comparison are oxford and salm/salm2, where MCSE looks better for the flex
> adaptation but (mean - expected)/MCSE looks better for develop.

There's really no statistical difference in those distributions, they're
both equally consistent with N(0, 1). In fast, that's true for just about
all of these. Trying to make any judgement by eye is a terrible idea,
these plots are just about identifying gross features.

> It seems that the speed issues are directly related to the
> lower step sizes and increased tree depth. This is why I'd
> like to see flexible adaptation tested with the old target acceptance
> rate. Or the develop branch tested with a higher target acceptance
> rate. Can't that be done by just changing the config for acceptance
> rates on develop and flexible adaptation branches?

The speed different is really due to subtleties in NUTS. As the step size
drops the ESS gets better, but not enough to compensate for the super linear
cost of building the NUTS tree.

> Or maybe we need to test on some other models? As Michael said, these
> models are all relatively easy to fit.

These tests were only to validate that the new code wasn't messing anything
up, not to motivate its existence. If every algorithmic change gets micromanaged
like this we're going to very quickly run end up with stagnant samplers.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Betancourt

unread,

Nov 29, 2013, 4:29:07 AM11/29/13

to stan...@googlegroups.com

I think we all agree on more stable code, but that presumes a few things.

Firstly, algorithmic updates are fundamentally different from software updates.
HMC is still a relatively young algorithm, and as we learn more about it
theoretically we learn more about how to implement it best in practice. The
difficulty is that we defined "stable" arbitrarily -- updates to HMC aren't
so much new features as improvements in the implementation.

Secondly, our process for updating the algorithms has to have some
level of fluidity. Here are various ways we could proceed,

1 - Acceptance by authority -- theory says we do it, we do it no questions.

2 - Acceptance by blind process -- any update has to be extensively validated.

I understand why (1) isn't desirable, but (2) alone isn't feasible either. Without
a perfect validation scheme we have to take into account the theoretical motivation
into the process.

Consider these BUGS acid tests, which are not at all representative of
the hard models we're seeing users struggle with on the list. The immediate
reaction has been to nit pick every plot and choose a winner by "majority".
Firstly the nit picking is statistically unfounded, secondly any such vote
subsumes that the BUGS models are representative.

The immediate reaction to this is to cherry pick comparisons that put
updates into the best light possible instead of understanding the true
context of an update. Yes, I can find two or three models that show significant
improvement but I think we're all mature enough to see the true context
of an update and make a more sophisticated decision.

In particular, increasing the acceptance probability makes Stan far more
robust to high-dimensional hierarchical models (see the hmc_for_hier draft),
and I'll trade some speed on easy models for more robust sampling overall
any day. This is a trade-off that no single comparison will decide but rather
one we have to make. But it's difficult to have that conversation when
people focus too hard on the latest tests instead of the entire context
(in all seriousness, have people forgotten the tests in the hmc_for_hier
paper or were they not clear enough for the relevance to be obvious?).

So how does the process proceed in this case? Do I cherry pick the tests
to prove my point, or do we all have the time to understand the context
before making any judgements?

Daniel Lee

unread,

Nov 29, 2013, 8:38:04 AM11/29/13

to stan...@googlegroups.com

I think you missed the point of my original message.

There's a difference in how we treat existing behavior from new behavior. We have a responsibility to our users to keep existing behavior stable.

Since you're asking for existing behavior to change, and not just any part of the existing behavior, but really how the defaults work, all I'm asking for is justification. The acid tests obviously don't provide it and only show that 1) the new stuff "looks" equally as bad and 2) is slower.

*** So all you need to provide is where this new stuff works better and why we should accept it as our defaults. *** (with that said, don't cherry pick the bugs examples cause all I'll do is cherry pick the bad ones to show it's not good... doesn't really help to go that route)

You've already shown the statistical properties of the changed stuff is about as bad as our current branch and that the code doesn't die.

If you can't provide justification, make it a new sampler. Don't change existing behavior. When we can justify changing the behavior of the defaults, we'll adopt it.

Daniel

Daniel Lee

unread,

Nov 29, 2013, 9:17:04 AM11/29/13

to stan...@googlegroups.com

Hopefully, RMHMC will improve on all the statistical properties. If we want to change the defaults, we are going to have the same discussion with the same level of scrutiny.

Damien

Michael Betancourt

unread,

Nov 29, 2013, 9:44:05 AM11/29/13

to stan...@googlegroups.com

No, I got your point but I maintain that it's not that simple.

We have apparently wedded ourselves to the existing behavior of an algorithm that is still being studied.

That means we do not have a proper criterion for determining a valid update; in particular we have

no definition of the ensemble of models that best represent our user community. The BUGS models,

for example, certainly don't provide a representative sample.

Now consider the flexible adaptation branch (really just the increase of the target acceptance probability

as the default adaptation parameters don't yield significantly different behavior). The easier BUGS

models get a bit slower, but we become much more robust to hierarchical models. I continue to point

to the studies done in the hmc_for_hier draft (why does everyone keep ignoring this?), with the BUGS

acid test showing that there are no pathological side effets. But because there is not universal improvement,

the update is a compromise, there is no criterion for determining whether to use it or not.

My point about cherry picking is that without a proper criterion a submitter could present biased

examples that don't show various pathologies. This could be explicitly dishonest or unintended,

but the danger to the code is the same.

So then how does one provide justification? If I present the acid tests to show reasonable performance

on a class of easy models then people rush to point out small, irrelevant differences. If I present a few

harder models in isolation that also seems to be unsatisfactory. If I show both there's no consensus

on how to weight them against each other.

Michael Betancourt

unread,

Nov 29, 2013, 9:45:35 AM11/29/13

to stan...@googlegroups.com

Hopefully, RMHMC will improve on all the statistical properties. If we want to change the defaults, we are going to have the same discussion with the same level of scrutiny.

Damien

I'm hoping this is a typo and not a identity change.

RHMC is going to run into the same issues -- its utility will depend on the ensemble of models to which it is being compared.

Without an prior definition its going to come down to one judgment or another.

Andrew Gelman

unread,

Nov 29, 2013, 10:23:56 AM11/29/13

to stan...@googlegroups.com

Let me just add three things:

1. I appreciate everyone's openness in discussing these issues.

2. Just as we have different algorithms, we could have different adaptations. Mike and Matt seem to know what they're talking about regarding algorithms and adaptations, so I'm inclined to trust their opinion on what they should be doing. On the other hand, if the proposed new adaptation rules are much worse than the old rules for most of the Bugs examples, that's relevant information. If, as Mike says, the Bugs models are easy cases (so it's no big deal to be much slower on them), but there are new hard models where the new adaptation works better, then I think it would make sense to add these to our corpus.

3. Also, if slowness is driven by a few slow chains that are in turn driven by poor adaptation arising from poor starting points, I'd like to know this. Perhaps I should sketch out the analysis I'm suggesting (it's a hierarchical model which we can dogfood in Stan, naturally), and this could serve as a template for future comparisons.

A

Michael Betancourt

unread,

Nov 29, 2013, 10:42:06 AM11/29/13

to stan...@googlegroups.com

(2) The big issue here is that we don't have a corpus! That makes it hard to compare, even if we had a paired comparison approach as you mention.

My biggest frustration is that even if we do have a corpus we can't interpret it as "all better or no go" since most updates will require compromises

between speed and robustness, and any criterion to best choose that compromise requires a proper representation of user models.

So, for example, a higher acceptance probability is better if most users are running high-dimensional hierarchical models but worse if everyone

is running BUGS models.

(3) The slowness isn't a feature of poor chains but straight up slower code. The overhead in NUTS makes its cost super linear with the inverse step size.

Something like 1/epsilon + log_{2} (1 / epsilon), although the coefficients would be important and the coefficients may depend on the specific model.

Note, for example, that not all of the models get slower. This significantly complicates the theoretical arguments.

Ben Goodrich

unread,

Nov 29, 2013, 10:56:44 AM11/29/13

to stan...@googlegroups.com

Some comments, mostly not particular to the flexible_adaptation branch.

I think the approach Stan has taken to choosing defaults is wrong. It shouldn't be what settings work best for most people most of the time. The defaults should be chosen to avoid disastrous behavior as much as possible. A small percentage of users can tweak the settings to make it go faster for particular special cases.
That said, many projects have a policy of waiting at least one release after a new possibility has been added before enabling it by default, unless there is a consensus that the existing default behavior is wrong.
That said, I think Daniel and Bob have been taking a position that is characteristic of commercial software that is not updated very frequently and cannot be allowed to break the workflow of existing customers. And I don't think Stan has reached that point yet. We don't have many paying customers, and it isn't that bad if a model that used to work stops working for a release. For reproducibility of academic work, researchers need to get in the habit of reporting the seed and the version number.
The BUGS examples are useless for making decisions. They illustrate how to do stuff in the Stan language to people who are already familiar with BUGS and provide a bunch of compilation tests. But, we do not know what the right answers are, they mostly have very little data and very little structure, they mostly take very little time to run (especially relative to the compile time), and they are mostly applicable to particular subfields that are not necessarily the subfields that most of us or most of Stan's users are interested in. In order to argue that one algorithm is better than another, you have to generate the data yourself, apply both algorithms to it, and integrate the comparison over the data generating processes.
Stan still needs more users and more developers. But I don't think there are many instances where we have convinced someone to use Stan over BUGS because the effective sample size per time is larger with Stan. Much more often, they have a model that is infeasible in BUGS that turns out to be feasible with Stan. Or conversely, they can't switch even though they want to because they heavily use some distribution / parameterization that is available for some variant of BUGS but is not in Stan yet.
At this point, I think we have more responsibilities to current grant funders than to users and need to position Stan to attract new grant funders. If so, then we need to lower the obstacles to adding new features.

Ben

Andrew Gelman

unread,

Nov 29, 2013, 11:02:01 AM11/29/13

to stan...@googlegroups.com

We do have a corpus, right? At least, the Bugs models play the role of a corpus for us, and I think Daniel has set up some scripts that will run all the Bugs models automatically. So we could add some high-dimensional hierarchical models to the corpus. In any case, I'm not saying that our rule should be "all better or no go," just that it is worth knowing if the overwhelming number of Bugs models are slower under the new version.

In any case, let me add a point 4 which I forgot to emphasize earlier:

4. I do think that being a methods development tool is one of the important roles of Stan. For one thing, the role as development tool is intimately connected to the role of applied research tool. If I want to fit big hierarchical models, I'm gonna need all the new methods being developed by the Matts and Mikes and the Girolamis of the world--so I want them working in Stan! It's worth a bit of sacrifice to users to be on the bleeding edge, I think.

Andrew Gelman

unread,

Nov 29, 2013, 11:08:26 AM11/29/13

to stan...@googlegroups.com

On Nov 29, 2013, at 4:56 PM, Ben Goodrich wrote:

I think the approach Stan has taken to choosing defaults is wrong. It shouldn't be what settings work best for most people most of the time. The defaults should be chosen to avoid disastrous behavior as much as possible. A small percentage of users can tweak the settings to make it go faster for particular special cases.

Good point, especially if the latter point is clear in the documentation. At the same time, _if_ an alternative setting works better in most of the Bugs examples, and we happen to know this, we might as well let people know it.

Stan still needs more users and more developers. But I don't think there are many instances where we have convinced someone to use Stan over BUGS because the effective sample size per time is larger with Stan. Much more often, they have a model that is infeasible in BUGS that turns out to be feasible with Stan. Or conversely, they can't switch even though they want to because they heavily use some distribution / parameterization that is available for some variant of BUGS but is not in Stan yet.

Interesting point. Certainly one of my motivations for Stan was because I couldn't fit hierarchical matrix models in Bugs at all. And, perhaps even more relevantly, Bugs was a black box. I never had any idea what Bugs was doing, or what was happening when it wasn't converging, etc. It was important for me for Stan to be a program we could get inside and play with if there was convergence problems.

Marcus Brubaker

unread,

Nov 29, 2013, 4:16:25 PM11/29/13

to stan...@googlegroups.com

Ok, I've been hesitant to weigh in on this for a number of reasons but after Ben's post I've decided to follow suite and post some high level comments.

After the mess that was the run up to 2.0 we all decided to add some more process to try to better manage things. So far, that has been working in that develop has pretty consistently been in a good state which could generally be released at any moment. This is a good thing and we should all stop and gives ourselves a little pat on the back about this.

Ben is right that Stan needs more users and more developers. To get more users we need to add more features and make the ones that are there better. To do that, we need more developers. (Or more hours in a day...) Lets be clear about this, Stan as it stands right now is a great tool and it's great that it can be useful to a lot of people but it is NOT a perfect finished product and moving to a development model which is predicated on minimizing change is a bad idea. Stan needs more work.

The problem with the new process is that it is greatly slowing this work. I'm not just talking about big features and changes but even small simple things. flexible_adaptation took a month and a half for anyone to so much as comment on. Other pull requests which provide clear improvements and are unit-tested but may not reach some nirvana-like ideal of unit testing are held in an indefinite limbo (e.g., the LDLT fixes and unit tests). Yet others which address clearly defined issued have been left with absolutely zero feedback after a month and a half (e.g., readding distance functions). It feels like the threshold for pull requests to be merged has become "is the feature perfect?" and unless the answer is a clear yes, it doesn't get merged. That would be fine for a very mature project but Stan is not that.

We have maybe a half dozen people contributing to the main Stan project right now and the current process is bogging things down. If simple pull requests can't be at least *commented on* in a reasonable period of time (a day or two?) something is broken and we're not going to be attracting any more developers. And, to be blunt, we may start loosing some. For me right now Stan is not a part of any of my main projects. I do it mostly in my spare time because I believe the project is important. For the few side projects where I do use Stan, I've now taken to maintaining a separate branch and my motivation to keep pushing to try to get things merged is dropping with my free time. I'm not trying to sound petulant here but this is the reality, we're all busy people and we put our efforts where they can be useful. Unmerged, uncommented pull requests are the opposite of useful.

Anyway, hope everyone had a Happy Thanksgiving and stuffed themselves with a suitable amount of tasty food! :)

Cheers,

Marcus

--

Bob Carpenter

unread,

Nov 29, 2013, 6:54:09 PM11/29/13

to stan...@googlegroups.com

It looks like we all understand each other at this point
and just have different opinions about where Stan should
go.

Among the core developers

* Matt, Jiqiang, Allen, and Peter haven't weighed in,

* Daniel and I favor slower development with an emphasis
on stability and a stricter process, and

* Marcus, Ben, Michael, and Andrew favor an emphasis on
research and new features and a more relaxed process.

Both goals make sense, so there's no "right" answer here.

With such differing opinions on our goals, it's obvious
that a consensus isn't going to work any more.
Nominally, Andrew's in charge, but as far as I know, he's
never even looked at the code.

Assuming nobody else has anything to add at this point,
this leaves open the question of what Marcus,
Ben, Michael, and Andrew think the process should be. I'm
specifically thinking about new code and doc in terms of:

* designing
* prioritizing
* coding
* testing
* continuously integrating
* documenting
* reviewing
* merging
* releasing
* integrating (R, Python, etc.)

An extreme answer would be to skip overall design and
prioritization, put no requirements on testing and doc,
shut down the continuous integration server,
skip code review, give everyone push permission, forgo
official releases, and leave those working on RStan
and PyStan to fend for themselves on integration.
I'm guessing this would inevitably lead to everyone forking the
code and working on his (or her if any women ever join
the team) own version.

- Bob

>> * I think the approach Stan has taken to choosing defaults is wrong. It shouldn't be what settings work best for

>> most people most of the time. The defaults should be chosen to avoid disastrous behavior as much as possible.
>> A small percentage of users can tweak the settings to make it go faster for particular special cases.
> Good point, especially if the latter point is clear in the documentation. At the same time, _if_ an alternative
> setting works better in most of the Bugs examples, and we happen to know this, we might as well let people know it.
>

>> * Stan still needs more users and more developers. But I don't think there are many instances where we have

>> convinced someone to use Stan over BUGS because the effective sample size per time is larger with Stan. Much
>> more often, they have a model that is infeasible in BUGS that turns out to be feasible with Stan. Or
>> conversely, they can't switch even though they want to because they heavily use some distribution /
>> parameterization that is available for some variant of BUGS but is not in Stan yet.
> Interesting point. Certainly one of my motivations for Stan was because I couldn't fit hierarchical matrix models
> in Bugs at all. And, perhaps even more relevantly, Bugs was a black box. I never had any idea what Bugs was doing,
> or what was happening when it wasn't converging, etc. It was important for me for Stan to be a program we could get
> inside and play with if there was convergence problems.
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> stan-dev+u...@googlegroups.com <mailto:stan-dev%2Bunsu...@googlegroups.com>.

Ben Goodrich

unread,

Nov 29, 2013, 8:19:47 PM11/29/13

to stan...@googlegroups.com

On Friday, November 29, 2013 6:54:09 PM UTC-5, Bob Carpenter wrote:

Assuming nobody else has anything to add at this point,
this leaves open the question of what Marcus,
Ben, Michael, and Andrew think the process should be. I'm
specifically thinking about new code and doc in terms of:

* designing

I think there wouldn't be an overall design for Stan, only a design for a proposed feature. Feature design is something that can be discussed, but ultimately whoever writes the code chooses its design.

* prioritizing

1. Grant commitments
2. Fixing (hopefully rare) build / test failures
3. Regression bug fixes
4. Everything else

* coding

I think it is okay to have style guidelines. In other words, it would be acceptable to delay merging something that does not correspond to the style guidelines no matter how good it is. Also, I think code should conform to the C++11 standard, even if we are not utilizing any new C++11 features yet (although I think it would be okay now to have tests that utilize C++11 features that run iff the C++11 flag is set).

* testing

We definitely need high standards on what degree of tests need be included with a new feature. Conversely, I think any pull request that only adds to src/test should be merged as long as the author attests that test-unit passes locally. If it turns out that test-unit doesn't pass on a different machine, fix the problem or revert the merge.

* continuously integrating

I continue to think we have inadequate hardware for our test suite. Running the whole test suite is overkill for many pull requests and is inadequate for a small number of big pull requests that really need more human review.

Part of it is that I said I would make it possible to do test-all on Hotfoot, and I did. But then they pushed us onto another server cluster that doesn't have an adequate toolchain, and I haven't had time to do much about it this semester. At this point, I think I should wait until clang 3.4 is released in late December before requesting that they install it and associated things on the new server.

* documenting

Similarly to the testing bullet point, we need high standards for documentation of new things and low standards for documentation of existing but inadequately documented things. Any pull request that only changes src/doc should be automatically merged and if it breaks make manual or make doxygen or whatever, then whoever broke it should fix it ASAP.

* reviewing and merging

I think we need some standards here, but I don't have a well-thought out opinion. As I mentioned above, I think we need less review than what we do now for some things and more review for others. For changes that entail heightened scrutiny, some options include

The "peer review" model that projects like Sage use where one other person needs to agree to it ( http://www.sagemath.org/doc/developer/walk_through.html#reviewing-a-patch )
Plurality voting among Stan core. After a big pull request, there would be some window for voting after which you are presumed to abstain.

But, here is another situation where more developers would help. Right now, we don't have enough people that know enough about various parts of Stan to provide timely and useful review (e.g. only Bob understands the parser).

* releasing

We agreed to the gitflow workflow but haven't followed it. I think we should follow it, but maybe it could be reconsidered. We have had a few bugs that I think would rise to the standard of necessitating a hotfix release that hasn't happened.

* integrating (R, Python, etc.)

The more interfaces the better, but I don't think the deserve any more or less priority than regular Stan. It is not that bad if an interface doesn't immediately support a new feature of Stan.

An extreme answer would be to skip overall design and
prioritization, put no requirements on testing and doc,
shut down the continuous integration server,
skip code review, give everyone push permission, forgo
official releases, and leave those working on RStan
and PyStan to fend for themselves on integration.

That would indeed be extreme and more extreme than what I am advocating.

Ben

Jiqiang Guo

unread,

Nov 29, 2013, 10:01:54 PM11/29/13

to stan...@googlegroups.com

I think one issue is that we have limited time and so we cannot use the standards for a commercial software. Here are some thoughts.

I am not a software developer, so I don't care about whether a good process for software development and the best implementation are needed. For example, I do not think much about gitflow; I think it is fine that the doc is minimally enough (for instance, it is enough for me if Ben understands the doc in rstan); I am not interested in all the request of users, in particular if there are other ways to get to the functionality; I don't want to care about platforms such as windows and cygwin on windows; I don't want to care about rstudio users (fix problems that exist only there); I do not care about software design as I do not know much about it. For rstan, I think it is fine to leave some testing to the users. Then I would save more time to work on something more important for me or for stan.

But I do think doc is important thought it could be minimum.

--

Jiqiang

Michael Betancourt

unread,

Nov 30, 2013, 6:49:40 AM11/30/13

to stan...@googlegroups.com

I certainly don't think we should invoke the nuclear option.

Yes, our process has been slow in the past month but we discussed this
at length at the last Stan meeting and I believe it shouldn't be an issue
provided we keep on top of the organization. Formalizing the requirements
of a pull requests will help substantially. So in terms of most of the process
I myself have no issues (one possible exemption being design, only
because we do not have the man power to enforce a rigid design
at the moment).

My original point has always been that we have to differentiate between
software and statistical algorithms. As Andrew and Ben have corroborated,
the latter should be more fluid because we cannot adequately define
absolute criteria for "correctness" and so cannot put it under the extreme
scrutiny we can put on software without causing significant inefficiencies.
Not to say that algorithmic changes shouldn't be discussed -- they should --
we just can't treat them as narrowly as one would treat an update to code.

In particular, I am very much in support of a stable code base while
allowing the statistical elements (such as sampler parameters, tunings, etc)
to be more agile.

Daniel Lee

unread,

Nov 30, 2013, 8:42:30 AM11/30/13

to stan...@googlegroups.com

Bob, thanks for putting the list together. Ben, Jiqiang, and Michael thanks for weighing in. I wanted to reply since it seems like I'm advocating slower development when I'm not.

* designing

I'm with Ben -- we should evaluate design for a proposed feature. For a proposed feature, it's up to whoever writes the code. The only thing I would evaluate is whether the design is transparent enough for someone other than the original writer to maintain.

* prioritizing

1. Fixing bugs in Stan that users are hitting

2. User experience with Stan

3. New features

4. Speed

(I guess grant commitments needs to fit on this list somewhere....)

* coding

I'm pragmatic. I value maintainability more than anything else. Someone else should be able to pick up that piece of code and be able to fix problems or extend functionality.

* testing

I am NOT in favor of testing to the nines. What I want in testing: proof that code executes as intended. It will serve a few purposes including an example of what the original writer of code intended the code to do and shows that the code can execute without killing everything else. I've found the hardest part of getting people to adopt testing is going from 0 tests to 1 test. But having at least one test makes the code itself much more maintainable because it shows someone else exactly how to execute sometime.

At the start, I was very much more trusting of everyone's code. But as we've moved forward, it became clear that not all contributed code is created equal. So I just want the minimum level of testing. More to show that the developer is capable of writing testable code and can prove that the code works. (If you think that safety of code is guaranteed by construction, it isn't.)

* continuously integrating

I think continuous integration is good, but as Ben mentioned, we're misusing our continuous integration server. Currently, most pull requests aren't tested by the developer creating pull requests, but they get pushed out to our server. If someone has a better suggestion, we can implement it.

* documenting

I think we should be documenting enough to get us away from users getting confused and having a decent user experience. I take the opposite approach Ben's suggesting of merge first. I think if it's only a doc change, it should run through just a doc test, but it shouldn't get merged until we're sure it works.

* reviewing

I really dislike process, but realize it's necessary. The things I look for when reviewing are: 1) the code does what is intended and 2) whether someone else can pick up the code to fix it in the future. That should be pretty quick to determine (< 5 minutes).

* merging

I believe that things should get merged when its clear that it does what it intends to do, it can be maintained by someone else, and we're reasonably certain that it isn't broken.

Before we had the current process, every state of master was a crap shoot and it ground my productivity to a halt because I merged and had to find out why tests on my branch weren't passing.

* releasing

GitFlow looks like it would be useful if we could follow it, but it's pretty intense and hard to keep on top of. I would prefer faster releases. I would want a push button release, which we don't have now.

* integrating (R, Python, etc.)

I think we should have a set of basic tests that show what each interface can do and a user can see if features are implemented. Other than that, I don't really care as long as users are able to use it.

* software and statistical algorithms (from Michael's email... btw, I still don't understand what you're trying to differentiate)

As I've mentioned before, I think there's a fundamental difference to changes to features vs new features. Once a feature is in, it should only get better. If that condition doesn't hold, then it breaks the user experience part of my priorities. We've already heard of people going from v2.0.1 back to v1.3.0 because of speed. For me, that's going in the wrong direction. I still have the same requirement as before: proof that code executes as intended. If it's a change to an existing feature, that also implies that I'm looking for some proof that the changed feature is an improvement. I think that bar is really low, but it's pragmatic. If whoever is changing the feature can show that it's better, awesome. It should go in. For actual bugs, this is clear. Something was broken. It's no longer broken. For something like a sampler, if the claim is that we're switching to the foo algorithm because it does bar, someone should show that foo does bar and that bar is better than what we currently have. If not, foo should be a new feature, as long as it does bar.

Some general comments: I really dislike process, testing, and managing a project like this. I want to trust everyone's code. If everyone everyone that submitted a pull request had readable and maintainable code and demonstrated that it works, it could go in as soon as we've verified that it works.

I haven't been able to work on anything remotely interesting since v2.0.0 because I've spent most of my time working on bugs that I didn't create. Since v2.0.0, I've created 22 pull requests (excluding trivial ones). Of those, 19 have been patches to other people's code, 3 of them have been patches to my code, 0 have been features. I think Bob's in the same boat.

I want to work on new features, like everyone else. But since we can't task people with work, a lot of the bug fixes have been falling on mine and Bob's shoulders. The process has helped keep develop in a clean state. Given that people aren't stepping up to fix the bugs they've introduced, my instinct is to try my best to keep them out.

With regards to slowly getting around to pull requests -- that's another place where people haven't been helping out. Perhaps they look and don't understand the code, in which case they should say so. That should be a big red flag. Marcus has pointed out that his pull requests don't get reviewed in a timely manner. I agree that we've been slow to getting to pull requests. But we could use some help.

Here's an example of where we can't assign work: for #346, I've probably spent more time than Marcus on cleaning up the code, looking at it, taking inventory, and respecting the work Marcus has put in. After reviewing, I created an itemized checklist of what needs to happen for it to get in. It all amounts to proof that the code executes and does what is intended. Since there's been no work from Marcus on that front, it's left to me to finish, which means I work on it when I get the chance. It's an improvement and not a bug, so it's lower priority. Since we didn't write down what was expected before Marcus created the pull request, it really is on us to finish it up and get it in, and not on Marcus. Ideally, I could request Marcus or Ben to finish it, but at the end of the day, if I want to see the request get in, it's on me to fix it. (Marcus, sorry to use that one as an example... this one's just the latest to come to mind, but I could have picked any of our developers.)

By the way, if anyone creates a pull request that is really easy to tell what's going on, has proof that it works as intended, I want it to get in as soon as we can verify that it works on an independent machine. The review process, in my mind, should be under 5 minutes. (Others might have a different opinion, but I still want to trust all developers.) Create self-contained, clean pull requests and everything will happen quickly. If it's sprawling and it takes a long time to understand what's going on, that slows everyone down. If that gets into the code base and it breaks, it'll take a lot longer to fix than not letting it into the code base in the first place.

Daniel

stan-dev+unsubscribe@googlegroups.com <mailto:stan-dev%2Bunsu...@googlegroups.com>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Nigel/essence

unread,

Nov 30, 2013, 9:22:03 AM11/30/13

to stan...@googlegroups.com

Thank you for being so open!

I think all these issues are very typical of projects like this. Many different pulls and pushes. Technical, personal.

My interest is purely as a commercial developer who is using y'alls good work. I do not use Stan itself, I merely have taken the original NUTS Matlab code and converted to Java and continued to integrate with other samplers and tuned it for my own needs.

I keep an eye on this group to find out your latest research ideas, not software details. I can understand if you have difficulties letting a commercial developer see inside your project. In practice, I gleen very little from this discussion forum. Presumably your research ideas are conducted much more privately.

Having previously interfaced between java and R for commercial software, I decided for ease of deployment and control to nowadays do everything in Java.

Not sure what I can add to the discussion apart from:

- for testing, use the principle of Myers 'The art of software testing'. Take cases where there is a known correct solution. Increase dimensions until it fails. Take cases which are difficult. Take cases which are expensive. A test failure is good, try to make tests which will fail. The purpose of testing is to find problems and fix them. Simple easy tests are not tests at all.

- My main tests are high dimension highly correlated Gaussian, and high dimension adapted Hario banana. A million samples of NUTS has difficulty (but not as much difficulty as RWM!).

- a year or so ago I did lots of tests with NUTS, DREAM, RWM and combinations on hard problems, it is very time consuming but was absolutely necessary to persuade my client (a major oil company) that this work was valuable. Trust me, it is. I ended up using all these samplers, they all have merits, and the proportion of time for each sampler depends on the situation. I am also using them for importance sampling, with the resample-move approach.

- although I have integrated NUTS with other samplers, my basic NUTS code is unchanged from the original matlab code. So I am interested in any algorithmic changes, and would rather not have to go through thousands of lines of code to find the changes. So maybe there is a value in maintaining the original Matlab code (which is hardly any lines at all).

- so Stan is more than just code, it is the technical documentation as well.

- I do wonder about scope creep, and probably that is because I do not understand it. There are existing implementations of LBFGS-B, and there is an implementation provided in existing R libraries. So, as an outsider, I wasn't sure why it was being redone in Stan. There is an up to date Fortran version, which is very mature, so why redo it? A year ago I converted that free version to Java, and am using it, I only converted because I want to deploy everything in Java. BTW, the conversion was not at all easy, and I had to do very careful tests to make sure the Fortran and Java gave exactly the same results at every stage. It would surely be madness to try to write a new LBFGS-B from scratch, rather than starting from the existing mature code?

- similarly, I don't know how much you are using existing low level maths utilities or writing your own. Who wants another cholesky factorisation? Particularly when you start looking at gpus, multithreading, and other tuning. Again, I did not follow this advice myself, I wrote all my own matrix algebra routines, but that was because I have experience in that from the 80's, and because the existing Java libraries were not up to scratch.

Sorry for the intrusion....and for any misunderstanding I might have about what you actually do. I just want to support any efforts to continue your excellent work, and am intrigued what the next generation (or maintenance) of NUTS will look like. The rest of it 'I could care less' in your AmeriEnglish.

One day I will publish my validated research in an SPE (Society of Petroleum Engineers) journal. I am currently under a non-compete restriction.

Nigel/essence

unread,

Nov 30, 2013, 10:03:37 AM11/30/13

to stan...@googlegroups.com

ps. one comment which occurred to me, and as an example of testing. I also read documents explaining some rational for implementing your own matrix algebra routines.

I implemented a mass matrix version of NUTS. One of my implementations built it using BFGS during the burn-in phase, and then fixed it.

What was interesting was that this worked very well for my highly correlated Gaussian distribution (which is not surprising, the mass matrix is constant over all space), but did worse than no mass matrix for the Harrio banana test function.

So any test suites should include cases where mass matrices are useful and where they are not useful, and guidance given in the documentation.

Does your experience reflect mine?

At the end, I implemented diagonal mass matrices which are similar to the suggestions in Nocedal and Wright for a starting point for BFGS.

But my point is that in just this area, testing is quite a difficult and time consuming process, even where test cases have a known analytical solution.

Bob Carpenter

unread,

Nov 30, 2013, 12:39:04 PM11/30/13

to stan...@googlegroups.com

On 11/30/13, 10:03 AM, Nigel/essence wrote:
> ps. one comment which occurred to me, and as an example of testing. I also read documents explaining some rational for implementing your own matrix algebra routines.

I'm not clear what you mean by "your own matrix algebra routines".
We're using Eigen for all of our linear algebra. For some operations,
like multiplication in the simplest case, we needed to implement our
own version for efficiency of auto-diff because Eigen doesn't allow
mixed type multiplication (in Stan, that would be agrad::var and double).

> I implemented a mass matrix version of NUTS. One of my implementations built it using BFGS during the burn-in phase, and then fixed it.
>
> What was interesting was that this worked very well for my highly correlated Gaussian distribution (which is not surprising, the mass matrix is constant over all space), but did worse than no mass matrix for the Harrio banana test function.

The problem with the banana is that there's no appropriate
global mass matrix because the covariance structure changes
depending on where you're at in the posterior.

> So any test suites should include cases where mass matrices are useful and where they are not useful, and guidance given in the documentation.

That would be a very good idea. As is, the doc is
rather vague on what command-line options to use for what.

- Bob

Nigel/essence

unread,

Nov 30, 2013, 1:00:12 PM11/30/13

to stan...@googlegroups.com

I'm not sure whether you are understanding my point....

It is counter intuitive, but tests should focus on test cases where NUTS has a hard time. Push NUTS to its limits, find cases where it is hopeless.

The objectives in testing are not the same as the objectives when publishing in journals. The former tries to make it fail, the latter tries to make it work!

At least NUTS should fail less spectacularly than alternatives.

When you have these hard test cases, you can then make recommendations and rationales for why they are difficult and what settings are needed.

Having got a test suite of difficult problems, with analytic solutions, you can then do regression testing with every new algorithm or algorithm change.

For each test, repeat 100 times, and also use different settings and starting points.

It might make your workstations warm.

In the optimisation world there is a 'standard' set of difficult test problems. What we need is an equivalent in the MCMC world, together with 'known' solutions. BUGS examples appear, from comments above, not to be this test suite. Finding the 'known' solution may be difficult.

This kind of process is what my oil company client is requesting. They do want to have some confidence (sic) in their reserves uncertainty quantification. So does SEC. With NUTS, we have some chance of achieving this, but it needs to be demonstrated. Otherwise every Tom Dick and Harry continues to present 10,000 RWM samples in 40 dimensional space as if it is the truth, whereas in fact it represents a biased fraction of the total uncertainty.

I only talk about this because of the discussions you are having about algorithmic changes and the testing process.

I think overall, within reason, I would rather get the right answer than the wrong answer in half the time.

Bob Carpenter

unread,

Nov 30, 2013, 3:04:41 PM11/30/13

to stan...@googlegroups.com

We're happy to do all of this in the open. Unlike a lot of
academic projects, we're not anti-commercial applications.

We also have private discussions for research ideas that people
want to keep private until they're published and also for
consulting projects to protect client's privacy.

I think we all understand the basic ideas of testing. Several
of us have commercial software development backgrounds.

We want to test two things. One, we want to know where NUTS and
HMC in general is going to fail and where it's going to work.
Two, we want to test for speed on a set of models that are representative
of what our users are going to do.

We're time constrained and nobody has expressed interest in maintaining
Matt's MATLAB code. The researchers among us would rather write
the ideas up in papers. But if you want to do it, feel free!

Yes -- Nocedal's reference implementation of L-BFGS is Fortran.
The reason to redo L-BFGS is to allow us to use our auto-diff lib
and operate over unconstrained parameter spaces. And to be able to
use the same modeling language as we use for everything else.
There's not another auto-diff lib out there that covers the functions
we need. Implementing derivatives by hand is time
consuming and error prone.

We're using Eigen for all the linear algebra and Boost for most
of the tricky math and RNG. Some of the mathematical ops we've
had to write ourselves because they're not in Boost. Some of the
derivatives we've also had to code by hand. Some of the matrix ops
we reimplemented for more efficient auto-diff.

- Bob

Nigel/essence

unread,

Nov 30, 2013, 5:30:44 PM11/30/13

to stan...@googlegroups.com

I'm glad that you understand the principle of testing is to make it fail!

I have found very few in the commercial software industry understand this basic principle! They go on and on about unit testing etc. and yet have the completely wrong psychological approach to testing.

I will be free to publish next year, and have a very open approach, even though all my competitors have the opposite tendency (Schlumberger, Haliburton et al). I will be happy to share my work over the coming months. I am starting to write it all up. It is more an integration of lots of different recent research rather than anything new.

I enclose an article I wrote for an SPE publication aimed at young engineers. It hints at the issues, but my next publication will be much more revealing.

articlefinal.doc

Nigel/essence

unread,

Nov 30, 2013, 5:40:46 PM11/30/13

to stan...@googlegroups.com

ps. - you already know where it will fail - multi modal distributions, as we (and Matt) have discussed before!

But graphs of number of function/gradient evaluations required to converge v. dimensions would be interesting, for different types of model. At least then would be able to predict how long it will take before convergence is reached - an hour or a week or a year.

What we can be sure is that as soon as algorithms (and hardware) improve, somebody will want to use them on ever bigger and more complex models.

Allen Riddell

unread,

Dec 1, 2013, 9:15:16 AM12/1/13

to stan...@googlegroups.com

Hi,

I'll just cast a vote in favor of stability and gradually widening the
performance gap between JAGS/BUGS.

Whence my vote for stability? Python does not have JAGS or BUGS so Stan occupies
a very different position in the statistical computing ecosystem than it does in
R.

One serious suggestion: what about a long-term support release cycle like Ubuntu
that permits rapid development and experimentation in the meantime. For those
who aren't familiar with this: Ubuntu (Linux) has a very stable release every
two years that they pledge to support/fix bugs for. Between the LTS releases
they have other more experimental releases for which they do not pledge the same
kind of vigilance. This seems to work very well. It's actually a very nice
contract between developers and users. Users know that they are getting
something that might have bugs if they don't take the LTS release.

Minor point:

- Echoing Jiqiang, I don't want to support Windows on PyStan until/unless
a Windows developer joins the PyStan team. I'd like to make this position
explicit.

Best,

Allen

Nigel/essence

unread,

Dec 1, 2013, 10:31:09 AM12/1/13

to stan...@googlegroups.com

On Saturday, 30 November 2013 20:04:41 UTC, Bob Carpenter wrote:
>
>

> We're time constrained and nobody has expressed interest in maintaining
>
> Matt's MATLAB code. The researchers among us would rather write
>
> the ideas up in papers. But if you want to do it, feel free!
>

I have my commercial code, and also a separate simple test environment to test MCMC methods. This is all in Java, derived from Matlab.

I need to continually demonstrate to my clients the reliability of my methods, using different simple but difficult test problems, and I would normally do this using my Java test environment.

But there is no big reason why I shouldn't do this in a Matlab environment, starting again from Matt's code.

Either way, I would be happy to share code and results.

I once did a project to optimise the optimisers - I optimised the tuning parameters for genetic algorithms, including weighting and parameters of different operators, including conventional crossover and mutation, plus particle swarm and DE. A similar project could be done for MCMC. Eg. for problem X, how much time should be spent on NUTS v. DREAM v. RWM?

I guess we could continue the discussion talking about what excites those in academia and enhances their careers and gets papers published, v. what is really important for practitioners. I don't think a post doc would get very excited about spending a year doing testing of existing algorithms, although it requires somebody of post doc capability and experience.

Bob Carpenter

unread,

Dec 1, 2013, 4:55:27 PM12/1/13

to stan...@googlegroups.com

On 12/1/13, 9:15 AM, Allen Riddell wrote:
> Hi,
>
> I'll just cast a vote in favor of stability and gradually widening the
> performance gap between JAGS/BUGS.

Thanks for responding.

> Whence my vote for stability? Python does not have JAGS or BUGS so Stan occupies
> a very different position in the statistical computing ecosystem than it does in
> R.

Have you looked at Python packages like emcee, PyMC, PyMCMC, or Theano?
There may not be a BUGS/JAGS wrapper, but there are lots of alternatives,
none of which I've ever tried. They all seem to be domain specific
EMBEDDED languages, unlike Stan/BUGS/JAGS, which are domain-specific
languages, but not embedded in a particular programming language.

http://en.wikipedia.org/wiki/Domain-specific_language#Usage_patterns

> One serious suggestion: what about a long-term support release cycle like Ubuntu
> that permits rapid development and experimentation in the meantime. For those
> who aren't familiar with this: Ubuntu (Linux) has a very stable release every
> two years that they pledge to support/fix bugs for. Between the LTS releases
> they have other more experimental releases for which they do not pledge the same
> kind of vigilance. This seems to work very well. It's actually a very nice
> contract between developers and users. Users know that they are getting
> something that might have bugs if they don't take the LTS release.

I'd like to hear more about how this works in terms of moving
toward the next stable release. Does everyone just fork in
the meantime and come back together? If not, how else are
the changes in the intermediate releases kept compatible with
each other?

If we only had a single stable release, I wouldn't be so
worried. As is, we've tangled up bug fixes and new features.
Ben may be able to help us on Git process here, which Daniel
and I keep seeming to mess up.

I think the big problem is that nobody has time to do the kind
of support that we need. For example, I don't think Ben has
the time to be in charge of our Git process, so we rely on
asking his advice from time to time (usually only after we've
dug ourselves into a hole). Ben, please correct me if I'm wrong --- I'd
be very happy for you to take over on managing the Git side of things!

> Minor point:
>
> - Echoing Jiqiang, I don't want to support Windows on PyStan until/unless
> a Windows developer joins the PyStan team. I'd like to make this position
> explicit.

I wish we could do the same for all of Stan! Windows has caused
us no end of headaches because of its lame C++ tool chain. Ironically,
for more than a decade, the only way to run BUGS was on Windows!

I don't think we can ignore Windows for R. Or could we?

My impression is that most of the R community are Windows users. And to
make matters worse, the RStudio wrapper around R is gaining popularity, which
adds another layer of config complication and indirection. We have the same
issue with Rcpp, because none of the Rcpp devs have Windows machines (or at
least none did when I unsubsscribed to their list after asking a few questions).

- Bob

Reply all

Reply to author

Forward