Higher-Order Namespace Issues with Distributions

40 views
Skip to first unread message

Michael Betancourt

unread,
Apr 26, 2015, 7:33:02 PM4/26/15
to stan...@googlegroups.com
So I’m trying to make the negative binomial CDFs more robust
with a much better analytic gradient for the regularized incomplete
beta function, but it looks like the math library is set up so that
distributions can’t be implemented for just reverse mode.

In particular, the distributions are all defined in
src/stan/math/prim/scal/prob and apparently have to be work
when substituting any autodiff variable in (rev and fwd vars).
Is this correct? If so we should be aware that making improvements
to any distribution is going to be so onerous in this framework
as to be practically untenable…

But let’s just consider
https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/prob/neg_binomial_cdf.hpp
for the moment. If we can’t just define a reverse mode version
then all of the special functions, like digamma and ibeta and
the new https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
have to be defined in src/stan/math/rev _and_ src/stan/math/fwd?
Or is the point of prim to define functions that can be autodiffed
through?

If it’s the latter then I’m still having problems with compilation.
Note that in https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
I used std::log — this allows the rev mode distribution to compile
(tests pass, etc) — but breaks the template substitution of
higher-order functions. But if I remove the std namespace then
the the double version doesn’t compile. What, exactly, is the
correct namespacing pattern for prim functions?

Bob Carpenter

unread,
Apr 26, 2015, 10:05:06 PM4/26/15
to stan...@googlegroups.com
see inline

> On Apr 27, 2015, at 9:32 AM, Michael Betancourt <betan...@gmail.com> wrote:
>
> So I’m trying to make the negative binomial CDFs more robust
> with a much better analytic gradient for the regularized incomplete
> beta function, but it looks like the math library is set up so that
> distributions can’t be implemented for just reverse mode.
>
> In particular, the distributions are all defined in
> src/stan/math/prim/scal/prob and apparently have to be work
> when substituting any autodiff variable in (rev and fwd vars).
> Is this correct?

No. The versions in prim only have to work for int and double
values. The goal is no dependencies from prim to fwd or rev, so
as to reduce spaghetti-like dependencies. Dependencies from fwd -> prim
or rev -> prim are OK, but not from fwd -> rev or vice-versa.

Not having rev bring in all of fwd should be a huge win for
compilation if we can separate it out in the compiler. I think
you're arguing that it's not a win for clarity, but I think it
helps with that, too, once you see how things are organized.

> If so we should be aware that making improvements
> to any distribution is going to be so onerous in this framework
> as to be practically untenable…

I don't see why.

> But let’s just consider
> https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/prob/neg_binomial_cdf.hpp
> for the moment. If we can’t just define a reverse mode version
> then all of the special functions,

You can --- just put it in rev.

> like digamma

A pure double-based version of digamma can go in prim.

> and ibeta and
> the new https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
> have to be defined in src/stan/math/rev _and_ src/stan/math/fwd?

If you bring in dependencies to rev it has to go in rev and
same for fwd.

But you'll see that the bulk of the pdfs and pmfs just make
more general definitions.

> Or is the point of prim to define functions that can be autodiffed
> through?

Not at all. But that's one way they can be done. You'll see
that the bulk of the pdfs and pmfs are more general and use the
views and constructors.

> If it’s the latter then I’m still having problems with compilation.
> Note that in https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
> I used std::log — this allows the rev mode distribution to compile
> (tests pass, etc) — but breaks the template substitution of
> higher-order functions. But if I remove the std namespace then
> the the double version doesn’t compile. What, exactly, is the
> correct namespacing pattern for prim functions?

You want to use the "using" pattern, not explicit imports.

I've got to run now, but hopefully Daniel can help out when it's
morning back in the U.S.

- Bob

Michael Betancourt

unread,
Apr 27, 2015, 3:50:54 AM4/27/15
to stan...@googlegroups.com
Now I’m even more confused.

1) If I put a distribution in prim then the distribution tests try
to instantiate it with rev and fwd vars. In particular, it tries
to instantiate template functions in prim with those vars
so double-based versions shouldn’t work.

2) There are no distributions in rev so no apparent way to
create a first-order-only version.
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Apr 27, 2015, 9:26:06 AM4/27/15
to stan...@googlegroups.com
Daniel and Rob did the refactoring, so they're best to
respond here. But see below.

> On Apr 27, 2015, at 5:50 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> Now I’m even more confused.
>
> 1) If I put a distribution in prim then the distribution tests try
> to instantiate it with rev and fwd vars. In particular, it tries
> to instantiate template functions in prim with those vars
> so double-based versions shouldn’t work.
>

If so, the tests are broken.

> 2) There are no distributions in rev so no apparent way to
> create a first-order-only version.

There aren't any there now because everything's implemented
with the general vector view, vector population and OperandsAndPartials
implementations. Then you bring in the rev and fwd implementations
of these as needed.

But check out some of the other functions --- you don't need to
follow the existing density functions.

- Bob

Daniel Lee

unread,
Apr 27, 2015, 2:58:18 PM4/27/15
to stan...@googlegroups.com
Sorry I'm late to respond. I'm just going to walk through the emails and reply inline. Not sure if there's a better way to get info across at this point.

On Mon, Apr 27, 2015 at 9:25 AM, Bob Carpenter <ca...@alias-i.com> wrote:
Daniel and Rob did the refactoring, so they're best to
respond here.  But see below.

> On Apr 27, 2015, at 5:50 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> Now I’m even more confused.
>
> 1) If I put a distribution in prim then the distribution tests try
> to instantiate it with rev and fwd vars.  In particular, it tries
> to instantiate template functions in prim with those vars
> so double-based versions shouldn’t work.
>

If so, the tests are broken.

The tests aren't broken. The distribution tests are written using a framework that we wrote. It's designed to test using double, reverse mode, and forward mode.

If it's trying to instantiate a double-only version or var-only version with something else, that's indicating that it may be an issue with the templating, argument dependent lookup, or something else.

Even after reading through your emails, I don't know what you're trying to do.

 
> 2) There are no distributions in rev so no apparent way to
> create a first-order-only version.

There aren't any there now because everything's implemented
with the general vector view, vector population and OperandsAndPartials
implementations.   Then you bring in the rev and fwd implementations
of these as needed.

But check out some of the other functions --- you don't need to
follow the existing density functions.

What Bob said is correct. You can go ahead and write a specialized version just for first order if you want. It'll probably need to be include in the right order for that function to be called, but it should work.

Daniel Lee

unread,
Apr 27, 2015, 3:07:58 PM4/27/15
to stan...@googlegroups.com
On Sun, Apr 26, 2015 at 10:04 PM, Bob Carpenter <ca...@alias-i.com> wrote:
see inline

> On Apr 27, 2015, at 9:32 AM, Michael Betancourt <betan...@gmail.com> wrote:
>
> So I’m trying to make the negative binomial CDFs more robust
> with a much better analytic gradient for the regularized incomplete
> beta function, but it looks like the math library is set up so that
> distributions can’t be implemented for just reverse mode.
>
> In particular, the distributions are all defined in
> src/stan/math/prim/scal/prob and apparently have to be work
> when substituting any autodiff variable in (rev and fwd vars).
> Is this correct?

No.  The versions in prim only have to work for int and double
values.  The goal is no dependencies from prim to fwd or rev, so
as to reduce spaghetti-like dependencies.  Dependencies from fwd -> prim
or rev -> prim are OK, but not from fwd -> rev or vice-versa.

Not having rev bring in all of fwd should be a huge win for
compilation if we can separate it out in the compiler.  I think
you're arguing that it's not a win for clarity, but I think it
helps with that, too, once you see how things are organized.


Bob's response is correct. You're using the testing framework, which wants to instantiate things with rev and fwd. Why don't you start by writing a one-off test in the appropriate place? The base template should continue to accept everything, but a template specialization won't need to.


> If so we should be aware that making improvements
> to any distribution is going to be so onerous in this framework
> as to be practically untenable…

I don't see why.

> But let’s just consider
> https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/prob/neg_binomial_cdf.hpp
> for the moment.  If we can’t just define a reverse mode version
> then all of the special functions,

You can --- just put it in rev.

Exactly. At some point, we had the base template and specializations for rev and separate specializations for fwd. We consolidated to make maintenance easier. If there are ways to specialize specifically just for rev that will be really efficient, we can break this apart again.
 
 
> like digamma

A pure double-based version of digamma can go in prim.

> and ibeta and
> the new https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
> have to be defined in src/stan/math/rev _and_ src/stan/math/fwd?

If you bring in dependencies to rev it has to go in rev and
same for fwd.

But you'll see that the bulk of the pdfs and pmfs just make
more general definitions.

> Or is the point of prim to define functions that can be autodiffed
> through?

Not at all.  But that's one way they can be done.  You'll see
that the bulk of the pdfs and pmfs are more general and use the
views and constructors.

> If it’s the latter then I’m still having problems with compilation.
> Note that in https://github.com/stan-dev/stan/blob/feature/issue-1426-ibeta_derivs_large_args/src/stan/math/prim/scal/fun/reg_inc_beta_derivs.hpp
> I used std::log — this allows the rev mode distribution to compile
> (tests pass, etc) — but breaks the template substitution of
> higher-order functions.  But if I remove the std namespace then
> the the double version doesn’t compile.  What, exactly, is the
> correct namespacing pattern for prim functions?

You want to use the "using" pattern, not explicit imports.

Yes. If instead of your current use, you say:
  using std::log;

and then replace all uses of std::log() with log(), then argument dependent lookup will use std::log() for doubles, stan::agrad::log() (defined in math/rev) for var and stan::agrad::log() (defined in math/fwd) for fvar.

For reg_inc_beta_derivs, are you sure you wanted this to be templated the way you have it? Should it all be double instead?

Michael Betancourt

unread,
Apr 27, 2015, 4:39:55 PM4/27/15
to stan...@googlegroups.com
So contrary to what Bob said, everything is prim _is_ meant to be
instantiated with all types of autodiff variable?  In which case
disabling a distribution for higer-order would require putting a no-op 
template specialization in fwd?

Yes. If instead of your current use, you say:
  using std::log;

and then replace all uses of std::log() with log(), then argument dependent lookup will use std::log() for doubles, stan::agrad::log() (defined in math/rev) for var and stan::agrad::log() (defined in math/fwd) for fvar.

But wouldn’t it compile using log by itself and including cmath?
Why doesn’t ADL work for doubles?

For reg_inc_beta_derivs, are you sure you wanted this to be templated the way you have it? Should it all be double instead?

It’s the same thing as the old regularized incomplete beta 
stuff — we shouldn’t be autodiffing through these functions
but if we don’t then we don’t have any fwd implementation
and so it was decided that that was better than nothing.

Bob Carpenter

unread,
Apr 27, 2015, 8:55:08 PM4/27/15
to stan...@googlegroups.com

> On Apr 28, 2015, at 6:39 AM, Michael Betancourt <betan...@gmail.com> wrote:
>
> So contrary to what Bob said, everything is prim _is_ meant to be
> instantiated with all types of autodiff variable?

Let's separate the intent of the namespace with the general functions
in prim from the probability densities.

The probability densities in prim are meant to be instantiated with
double if you just include prim. If you also include the template
specializations in fwd and rev, then you can instantiate with var and
fvar.

Other functions within the prim namespace work differently, with
a double-based implementation (or just using the cmath as in
case of exp and log) and then special fwd and rev implementations.

> In which case
> disabling a distribution for higer-order would require putting a no-op
> template specialization in fwd?

You mean one of the ones that's built using OperandsAndPartials?
That would probably work --- it's a matter of having the disabling
specialization match better by template matching rules, which are
fiddly.

>> Yes. If instead of your current use, you say:
>> using std::log;
>>
>> and then replace all uses of std::log() with log(), then argument dependent lookup will use std::log() for doubles, stan::agrad::log() (defined in math/rev) for var and stan::agrad::log() (defined in math/fwd) for fvar.
>
> But wouldn’t it compile using log by itself and including cmath?
> Why doesn’t ADL work for doubles?

It only works for classes, not primitive types, which aren't
defined in a namespace. What gets brought in is the namespace
where the class is defined. So even if a function uses var
as an argument, it won't get brought in unless it's included and
the function is defined in namespace agrad.

>> For reg_inc_beta_derivs, are you sure you wanted this to be templated the way you have it? Should it all be double instead?
>
> It’s the same thing as the old regularized incomplete beta
> stuff — we shouldn’t be autodiffing through these functions
> but if we don’t then we don’t have any fwd implementation
> and so it was decided that that was better than nothing.

And did you want to turn this off?

I'd think autodiffing through the functions would be better than
nothing. I understand it's not ideal and also understand that it can
be worse than not having anything if it's too bad.
Have you tested the accuracy of the derivatives?

- Bob

Michael Betancourt

unread,
Apr 28, 2015, 4:16:52 AM4/28/15
to stan...@googlegroups.com
> Let's separate the intent of the namespace with the general functions
> in prim from the probability densities.
>
> The probability densities in prim are meant to be instantiated with
> double if you just include prim. If you also include the template
> specializations in fwd and rev, then you can instantiate with var and
> fvar.
>
> Other functions within the prim namespace work differently, with
> a double-based implementation (or just using the cmath as in
> case of exp and log) and then special fwd and rev implementations.

The different behavior for fun and prim is particularly confusing,
especially with the former requiring rev and fwd specializing and
the latter not.

In particular, I think it’s incredibly misleading to say “prim just
gets instantiated with doubles if you just include prim” when
you’re not always in control of what gets included by the user.
All I wanted to hear was “the prob distributions in prim can
get instantiated with any autodiff type, depending on how they’re
used”.

> It only works for classes, not primitive types, which aren't
> defined in a namespace. What gets brought in is the namespace
> where the class is defined. So even if a function uses var
> as an argument, it won't get brought in unless it's included and
> the function is defined in namespace a grad.

That is helpful.

> I'd think autodiffing through the functions would be better than
> nothing. I understand it's not ideal and also understand that it can
> be worse than not having anything if it's too bad.
> Have you tested the accuracy of the derivatives?

Let’s consider the motivation behind this particular branch. I
was building a model with a colleague that required negative
binomial CDF for very large arguments, arguments large enough
to strain the naive gradient implementation which looped for the
max number of iterations before prematurely terminating at
an inaccurate value. And this was for the gradient — autodiffing
through this would be even worse.

I am very much in the belief that one of the best features of Stan
is that it informs the user of any problems and I’d rather support
this design by not including fragile higher-order derivatives.

Bob Carpenter

unread,
Apr 28, 2015, 10:33:36 AM4/28/15
to stan...@googlegroups.com

> On Apr 28, 2015, at 6:16 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
>> Let's separate the intent of the namespace with the general functions
>> in prim from the probability densities.
>>
>> The probability densities in prim are meant to be instantiated with
>> double if you just include prim. If you also include the template
>> specializations in fwd and rev, then you can instantiate with var and
>> fvar.
>>
>> Other functions within the prim namespace work differently, with
>> a double-based implementation (or just using the cmath as in
>> case of exp and log) and then special fwd and rev implementations.
>
> The different behavior for fun and prim is particularly confusing,

You mean the different way that functions in prim/scal/prob vs. prim/scal/fun
are implemented? We could use the approach in prim/prob for
examples in prim/fun. I don't think we need to keep the stan::prob
namespace vs. stan::math namespace distinction.

And I'm not sure about stan::agrad as a namespace. It was meant to
mean "automatic gradients", so it doesn't even make sense for fvar!

I'd be OK merging all of:

* stan::math
* stan::agrad
* stan::prob

It'd make the argument-dependent lookup easier to manage.

> especially with the former requiring rev and fwd specializing and
> the latter not.

Everything in prim requires rev and fwd specializations of at
least part of their implementations, because nothing in prim
should depend on fwd or rev.

> In particular, I think it’s incredibly misleading to say “prim just
> gets instantiated with doubles if you just include prim” when
> you’re not always in control of what gets included by the user.

The point isn't that prim only gets instantiated with double,
but rather that nothing in prim depends on rev or fwd. We should
be able to include just stan/math/prim and get a double-based
math library that works and never includes any fwd or rev functionality.

Whenever you write a template function in C++, you can't control
how it gets instantiated unless you pile on disable_if/enable_if
conditions. If I write

template <typename T>
T id(T x) { return x; }

it'll be the identity function for any type T a user implements,
as long as supplies a copy constructor.

> All I wanted to hear was “the prob distributions in prim can
> get instantiated with any autodiff type, depending on how they’re
> used”.

That's true if you add that "and I've included all the template
specializations required for fwd and rev" (of the vector views
and operands/partials structures). If you only include
prim, you won't be able to do that.

Sorry if I was being too literal --- it's the danger of a computer
science upbringing :-)

>> It only works for classes, not primitive types, which aren't
>> defined in a namespace. What gets brought in is the namespace
>> where the class is defined. So even if a function uses var
>> as an argument, it won't get brought in unless it's included and
>> the function is defined in namespace a grad.
>
> That is helpful.
>
>> I'd think autodiffing through the functions would be better than
>> nothing. I understand it's not ideal and also understand that it can
>> be worse than not having anything if it's too bad.
>> Have you tested the accuracy of the derivatives?
>
> Let’s consider the motivation behind this particular branch. I
> was building a model with a colleague that required negative
> binomial CDF for very large arguments, arguments large enough
> to strain the naive gradient implementation which looped for the
> max number of iterations before prematurely terminating at
> an inaccurate value. And this was for the gradient — autodiffing
> through this would be even worse.
>
> I am very much in the belief that one of the best features of Stan
> is that it informs the user of any problems and I’d rather support
> this design by not including fragile higher-order derivatives.

I think it should be a judgment call about how fragile they are.
In a very real sense, all of our functions are fragile because floating
point's fragile. Subtraction's a disaster! But even exp() only works with
inputs with absolute values less than 800. That seems very fragile to me,
but it doesn't mean we don't include it. And inverse() for matrices is fragile
if the input matrix is not well conditioned, but we still include it.
(Someone at my talk tonight, who was older than me, told me that his father,
who is now 90, drilled into his head when he was a wee lad that one should
never invert a matrix --- so at least part of my talk resonated!)

So I think it comes down to where to draw the line. That's what I
meant by "be worse than not having anything if it's [the behavior] too
bad". If it's too ill-behaved then I agree we shouldn't include it.

- Bob

Reply all
Reply to author
Forward
0 new messages