Jacobian needed?

1,135 views
Skip to first unread message

Daniel Lakeland

unread,
Aug 13, 2015, 6:51:26 PM8/13/15
to Stan users mailing list
Stan often spits out warnings about jacobian corrections, and I can't figure out logically whether they're needed for my purposes. Let me give some kind of simple example.

Suppose I have some data x, and some data y.

i know that if I shift and re-scale x by some unknown amounts, that I can transform the result through some nonlinear transformation, and get y to within some amount of error. For example:

y = exp( (x-a)/b) + error

Suppose I have some other information about a, and b. Either prior, or some kind of measurements that I can relate to a and b.

a ~ normal(a_prior,10);
b ~ normal(b_prior,10);
a_data ~ normal(a,1);
b_data ~ normal(b,1);


now I'd like to specify the information I have about the relationship between y and x, i could do.

exp(-(x-a)/b) ~ normal(y,ysigma);

and of course, stan is going to spit out warnings about the need for a jacobian... But is it true that I need a jacobian correction? This isn't information about a, or b, it's information about exp(-(x-a)/b) whatever that combination turns out to be, and it seems incorrect to alter the distribution via a jacobian correction because that would imply DIFFERENT information from what I have, no?

Daniel Lakeland

unread,
Aug 13, 2015, 7:00:51 PM8/13/15
to Stan users mailing list
I should say, obviously if the posterior for a,b are delta-function-like then the transform doesn't make much difference since every differentiable transform is linear on a small enough scale of variation, so let's assume the posterior isn't super narrow for a,b in this example.



Michael Betancourt

unread,
Aug 13, 2015, 7:32:07 PM8/13/15
to stan-...@googlegroups.com
Yes, the statement informs the variable z=exp(-(x-a)/b) but you’re _not
defining a distribution on z_, you’re defining a distribution on (a, b).  The
Jacobian is exactly the map that takes information about z and transforms
it to information about the (a, b) that we really care about.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Aug 13, 2015, 7:58:36 PM8/13/15
to stan-...@googlegroups.com

> On Aug 13, 2015, at 6:51 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
> Stan often spits out warnings about jacobian corrections, and I can't figure out logically whether they're needed for my purposes. Let me give some kind of simple example.

The short answer is that if you need it to define the density,
you need to code it in Stan. We aren't doing anything differently
than all the stats textbooks out there.

The key is that Stan's density is defined in the model block over
the constrained variables defined in the parameters block. We handle
all the Jacobians from the unconstrained space, so you only need to
worry about defining a density on the variables defined in the
parameters block.

Each

a ~ foo(theta)

is equivalent to adding a term foo_log(a, theta) to the log density.

> Suppose I have some data x, and some data y.

>
> i know that if I shift and re-scale x by some unknown amounts, that I can transform the result through some nonlinear transformation, and get y to within some amount of error. For example:
>
> y = exp( (x-a)/b) + error
>
> Suppose I have some other information about a, and b. Either prior, or some kind of measurements that I can relate to a and b.
>
> a ~ normal(a_prior,10);
> b ~ normal(b_prior,10);
> a_data ~ normal(a,1);
> b_data ~ normal(b,1);
>
>
> now I'd like to specify the information I have about the relationship between y and x, i could do.
>
> exp(-(x-a)/b) ~ normal(y,ysigma);
>

This last statement is equivalent to

y ~ normal(exp((a - x) / b), ysigma);

which is the usual way to write the non-linear regression you provided above. Written
this way, you shouldn't need a Jacobian. It depends on what density you want to define.
(I've also eliminated a negation for efficiency.)

The key thing is to understand the density you're defining and make sure
it's the one you want.

- Bob

Daniel Lakeland

unread,
Aug 13, 2015, 8:05:43 PM8/13/15
to Stan users mailing list
Hmm... that statement illuminated my issue a little better. Suppose that I don't care about a,b I only care about z!!

Think of this as a noise-filtering issue, i have y which is noisy measurements of z, and I'd like to figure out z, and I happen to know that z has a certain form with some unknown parameters involved, but I do have some information about the unknown parameters, but I don't really care what the parameters are and I don't really have any strong knowledge of the probability distribution over a,b.


Michael Betancourt

unread,
Aug 13, 2015, 8:08:24 PM8/13/15
to stan-...@googlegroups.com
Then you’d define z as a parameter.

On Aug 13, 2015, at 5:05 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:

Hmm... that statement illuminated my issue a little better. Suppose that I don't care about a,b I only care about z!!

Think of this as a noise-filtering issue, i have y which is noisy measurements of z, and I'd like to figure out z, and I happen to know that z has a certain form with some unknown parameters involved, but I do have some information about the unknown parameters, but I don't really care what the parameters are and I don't really have any strong knowledge of the probability distribution over a,b.



Daniel Lakeland

unread,
Aug 13, 2015, 8:25:32 PM8/13/15
to Stan users mailing list

Ok, so z is a vector, and I want to impose the constraint that z_i = exp(-(x_i-a)/b) exactly for some  unknown a,b and I want stan to find a,b which are consistent with my knowledge that

z ~ normal(y,ysigma)

and the other knowledge that I had about a,b (see above) in which a,b appear without transformation, but I don't care about a,b per-se they are nuisance parameters. What I care about is z, and/or values of the function say at other x values (prediction) etc.

In some sense, you might say that the function is defined by "a,b" and so I should care about a,b and define the proper distribution over a,b but this isn't always obviously the case.

Daniel Lakeland

unread,
Aug 13, 2015, 8:27:46 PM8/13/15
to Stan users mailing list


On Thursday, August 13, 2015 at 4:58:36 PM UTC-7, Bob Carpenter wrote:


> now I'd like to specify the information I have about the relationship between y and x, i could do.
>
> exp(-(x-a)/b) ~ normal(y,ysigma);
>

This last statement is equivalent to

  y ~ normal(exp((a - x) / b), ysigma);

which is the usual way to write the non-linear regression you provided above.  Written
this way, you shouldn't need a Jacobian.  It depends on what density you want to define.
(I've also eliminated a negation for efficiency.)


I don't need a Jacobian because stan figures it out and puts it in for me, or I don't need a Jacobian because of some other reason?

 
The key thing is to understand the density you're defining and make sure
it's the one you want.

If I knew that, I wouldn't have to ask!!! :-) Partly, the issue is I'm not entirely sure what Stan does and doesn't do under the hood, so your answer to the above would bring me closer to that understanding.


Daniel Lakeland

unread,
Aug 13, 2015, 8:51:11 PM8/13/15
to Stan users mailing list

To give you a sense of how this might arise, suppose that I want to define a 2D gaussian process over a square and then transform that gaussian process through a fairly complex nonlinear function of a particular form and compare it to some data. I expect that under the transformation, the errors between data and the transformed process have some known distribution, and i want samples of the TRANSFORMED gaussian process, the underlying process is of no interest by itself.

This situation is obviously a lot more complex than the simple a,b one, and the jacobian corrections certainly seem .... odd ...

Daniel Lakeland

unread,
Aug 13, 2015, 9:18:50 PM8/13/15
to Stan users mailing list
Or a way that the original example model could arise:

You have a very fast clock circuit which drives some kind of measurement device. Because the clock is so fast it's hard to know precisely how fast, but you have measurements of brightness at a bunch of time values x given in clock ticks, and for a while after some event (time a) the brightness increases exponentially. Unfortunately it's a noisy brightness measurement and you have gaps in your data.

You know that there exists some time-scale on which you can measure x such that an exponential curve goes right through the data. What you want to know is how bright was it during some gap in your data? You want samples of z=exp((x-a)/b) during the gap and also filtered y values (z samples) for the x values you have y values for.

You don't care about a, or b only samples of z, and that the z samples come in the form exp(-(x-a)/b) for some a,b about which you have only diffuse prior information, and maybe some independent measures, and that the a,b should make z have approximately normal distribution relative to y.


Jonah

unread,
Aug 13, 2015, 10:14:35 PM8/13/15
to Stan users mailing list


On Thursday, August 13, 2015 at 8:27:46 PM UTC-4, Daniel Lakeland wrote:
On Thursday, August 13, 2015 at 4:58:36 PM UTC-7, Bob Carpenter wrote:
This last statement is equivalent to

  y ~ normal(exp((a - x) / b), ysigma);

which is the usual way to write the non-linear regression you provided above.  Written
this way, you shouldn't need a Jacobian.  It depends on what density you want to define.
(I've also eliminated a negation for efficiency.)


I don't need a Jacobian because stan figures it out and puts it in for me, or I don't need a Jacobian because of some other reason?

Both, kind of. Stan is automatically doing the needed Jacobians to transform any constrained parameters (declared in parameters block) to the unconstrained space for sampling. But when you have exp((a - x) / b) on the left side of the sampling statement it requires an additional Jacobian since you're placing a distribution on a non-linear transformation of parameters. When you write it the way Bob did there's no non-linear transformation on the left side of the sampling statement and so no additional Jacobian adjustment required. It works out to the same thing. 

Bob Carpenter

unread,
Aug 13, 2015, 10:40:34 PM8/13/15
to stan-...@googlegroups.com

> On Aug 13, 2015, at 8:27 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
>
>
> On Thursday, August 13, 2015 at 4:58:36 PM UTC-7, Bob Carpenter wrote:
> ...

> I don't need a Jacobian because stan figures it out and puts it in for me, or I don't need a Jacobian because of some other reason?

It's a matter of which density you're trying to define.

Stan doesn't automatically include any Jacobians other than ones
implied by the constraints on parameters.

> The key thing is to understand the density you're defining and make sure
> it's the one you want.
>
> If I knew that, I wouldn't have to ask!!! :-) Partly, the issue is I'm not entirely sure what Stan does and doesn't do under the hood, so your answer to the above would bring me closer to that understanding.

The manual describes exactly what Stan does under the hood.

In short, it transforms the constrained parameters to be unconstrained
as described in the chapter on transforms near the end of the manual,
and applies the appropriate Jacobian correction so that the priors
are by default all uniform on the *constrained* space.

After that, it's up to you to define the right density on the
constrained parameters.

- Bob

Bob Carpenter

unread,
Aug 13, 2015, 10:42:57 PM8/13/15
to stan-...@googlegroups.com
It's not a matter of what you care or don't care about.

If you have a random variable X and some non-linear function
f and say f(X) ~ foo(theta), then you need to account for
the change of variables implied by f.

If you instead have a change of parameterization, and have
a variable Y and a parameter phi and a non-linear function g,
you don't need a Jacobian adjustment to set Y ~ bar(g(phi)).

- Bob

Ben Goodrich

unread,
Aug 13, 2015, 10:57:08 PM8/13/15
to Stan users mailing list
On Thursday, August 13, 2015 at 6:51:26 PM UTC-4, Daniel Lakeland wrote:
Stan often spits out warnings about jacobian corrections, and I can't figure out logically whether they're needed for my purposes.

The short answer is "If you get a parser warning about a Jacobian correction, you need to do the Jacobian correction unless you are sure that the (logarithm of the absolute value of the) Jacobian determinant in question is a constant". Conversely, if you don't get a parser warning about a Jacobian correction or you are sure that the (logarithm of the absolute value of the) Jacobian determinant in question is a constant, then you don't need to do a Jacobian correction unless you need a normalized log-density for some reason.

Ben

Daniel Lakeland

unread,
Aug 13, 2015, 11:34:11 PM8/13/15
to stan-...@googlegroups.com

Obviously, you're right that it matters what pdf I'm trying to create, and that's what I'm getting at. The two programs encode different models, but it's not entirely clear that the model WITH the jacobian *is the model* that I want. Furthermore, it's not so clear that "a nonlinear transform on the left hand side == needs a jacobian" and "no nonlinear transform" means no jacobian.

I guess I'm trying to figure out how to understand the meaning of the model with and without the jacobian so I can figure out which one I want! :-)

I'm particularly puzzled by the bit about putting the nonlinear transform on the right.

First off, normal_log(a,b,c) is mathematically symmetric with normal_log(b,a,c) since they're both -((a-b)/c)^2/2 right?

so I don't see how

exp((x-a)/b) ~ normal(y,ysigma);

leads to different mathematics from

y ~ normal(exp((x-a)/b),ysigma);

where I now "don't need a jacobian"

also, as Michael pointed out, if I really "care about inference on z" then I should make it a parameter. But it's a parameter that I'm requiring a certain structure to. So I could do it in stan like this:
parameter{
vector[100] z;
}
...

z ~ normal(exp((x-a)/b, 1e-36);
y  ~ normal(z, ysigma);

there's no "nonlinear transformation on the left hand side" so "I don't need a jacobian", but z is with probability 1 so close to exp((x-a)/b) that it doesn't make a difference and this is terribly hard to sample (my guess). Nevertheless, suppose what I want is the limit of this model as the z stddev goes to zero. How do I code it in stan?

Bob Carpenter

unread,
Aug 14, 2015, 12:22:25 AM8/14/15
to stan-...@googlegroups.com

> On Aug 13, 2015, at 11:34 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
...

> How do I code it in stan?

The reason we're all confused is that we always start with a density,
then figure out how to code it in Stan. You're starting with Stan code
and we're confused about what the density is that you want to code.

I'd strongly suggest working through sections 15.2 (Reparameterization) and
15.3 (Change of Variables) of the manual. 15.2 shows how to reparameterize
the Beta distribution, where no Jacobian is required. 15.3 shows how to define
lognormal in terms of normal, where a Jacobian is required.

They do different things. Reparameterize transforms the parameters of
distribution (section 15.2), whereas changes of variables reparameterize
the outcome variable (section 15.3).

- Bob

Michael Betancourt

unread,
Aug 14, 2015, 12:56:24 AM8/14/15
to stan-...@googlegroups.com
No, they are the same model.  I’m not going to get into a big
diatribe about measures vs densities here, but you’re confusing
some fundamental concepts in probability theory.  Perhaps it
will help you get started if you recognize that the fundamental
object is a measure,

p(x) dx

where p(x) is the density.  If you change parameters then you
get a new density and a new volume, but the measure has to
stay the same,

p(x) dx = p(y) dy

or

p(x) = p(y) | dy / dx | = p(y(x)) | dy/dx (x) |

where | dy/dx (x) | is exactly the Jacobian.

On Aug 13, 2015, at 8:34 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:


Obviously, you're right that it matters what pdf I'm trying to create, and that's what I'm getting at. The two programs encode different models, but it's not entirely clear that the model WITH the jacobian *is the model* that I want. Furthermore, it's not so clear that "a nonlinear transform on the left hand side == needs a jacobian" and "no nonlinear transform" means no jacobian.

I guess I'm trying to figure out how to understand the meaning of the model with and without the jacobian so I can figure out which one I want! :-)

I'm particularly puzzled by the bit about putting the nonlinear transform on the right.

First off, normal_log(a,b,c) is mathematically symmetric with normal_log(b,a,c) since they're both -((a-b)/c)^2/2 right?

so I don't see how

exp((x-a)/b) ~ normal(y,ysigma);

leads to different mathematics from

y ~ normal(exp((x-a)/b),ysigma);

where I now "don't need a jacobian"

also, as Michael pointed out, if I really "care about inference on z" then I should make it a parameter. But it's a parameter that I'm requiring a certain structure to. So I could do it in stan like this:
parameter{
real z;

}
...

z ~ normal(exp((x-a)/b, 1e-36);
y  ~ normal(z, ysigma);

there's no "nonlinear transformation on the left hand side" so "I don't need a jacobian", but z is with probability 1 so close to exp((x-a)/b) that it doesn't make a difference and this is terribly hard to sample (my guess). Nevertheless, suppose what I want is the limit of this model as the z stddev goes to zero. How do I code it in stan?

Daniel Lakeland

unread,
Aug 14, 2015, 11:46:25 AM8/14/15
to stan-...@googlegroups.com

I apologize for any frustrations here. Part of my frustrations are that I'm used to being able to translate math into code more freely. Stan is a very high level language, so it's sometimes worrisome that small differences in expressions might lead to large differences in results.

Going back to the exponential model, I'll explain the math I'm trying to achieve. y and x are data vectors, a, and b are parameters, I need to specify a joint density:

p(y,x,a,b);
I factorize it as:

p(y|x,a,b) p(x|a,b) p(a)p(b) (since a and b are independent in my model)

now, p(y|x,a,b) = product(i, normal(y, exp((x[i]-a)/b), sigmay) /*EDITED TO FIX ORIGINAL TYPO*/
p(x|a,b) = 1 (the probability of the data x is just 1, the data just *is*)
p(a) = normal(a*,sigma_a); /* some kind of prior*/
p(b) = normal(b*,sigma_b); /* some kind of prior*/

let's ignore the "data on a" and "data on b" since they are a red herring I think.

Now, I believe that Bob has recommended I code this in stan as:

a ~ normal(amean,asigma);
b ~ normal(bmean,bsigma);
y ~ normal(exp((x-a)/b),ysigma);


what I don't understand is why mathematically I would get any different answer if I coded it as:

a~normal(amean,asigma);
b~normal(bmean,bsigma);
exp((x-a)/b) ~ normal(y,ysigma);

or even

exp((x-a)/b) - y ~ normal(0,ysigma);

or especially (since I'd like to sample z)

transformed_parameter {
real z;

z <- exp((x-a)/b);
}

model{
...
z ~ normal(y,ysigma);
}

especially because normal_log(foo,bar,baz) == normal_log(bar,foo,baz) (doesn't it?).

Now, perhaps you feel that this is an abuse of notation, or bad Stan programming practice, but is it going to give me the wrong answer relative to the model expressed in mathematics at the top?




Daniel Lakeland

unread,
Aug 14, 2015, 2:46:00 PM8/14/15
to Stan users mailing list


On Thursday, August 13, 2015 at 9:56:24 PM UTC-7, Michael Betancourt wrote:
 
 Perhaps it
will help you get started if you recognize that the fundamental
object is a measure,

p(x) dx

where p(x) is the density.  If you change parameters then you
get a new density and a new volume, but the measure has to
stay the same,

p(x) dx = p(y) dy

or

p(x) = p(y) | dy / dx | = p(y(x)) | dy/dx (x) |

where | dy/dx (x) | is exactly the Jacobian.

If you have a precise mathematical model, and you do a change of parameters, then yes you keep the measure the same this way.

But, I'm not starting with a precise model FOO and doing a change of parameters to model BAR and wanting BAR to have the same mathematical properties as FOO!

I'm starting with a model BAR and expressing it by interchanging the order of arguments in normal_log and wondering why both Bob and Stan think that this means I need to give it a jacobian whereas I don't if I change the order of arguments back!

So, there's two questions I guess. first off, is why does Bob say that

exp((x[i]-a)/b) ~ normal(y,sigmay);

"requires a jacobian correction" but

y[i] ~ normal(exp((x[i]-a)/b),sigmay)

"does not require a jacobian" even though normal_log is symmetric in the first two arguments

Second, when my parameters are invented fictions anyway, let's say rather than a,b we have something like the 2D gaussian process, and I'm going to transform them through some nonlinear transform that's complicated, and I'm going to want to know what the data says about this transformed parameter can't I look at this as if the presence or lack of the jacobian really alters what my prior is over the parameters, but by the way I don't have a "true" prior over the parameters. So the lack of the jacobian doesn't make my model "wrong" per-se it makes it "something that isn't explicit there in the code".

in other words, Stan's message seems to insist that there is a model FOO with which I need to maintain equivalence when such a thing isn't necessarily the case.

As Bob says: it all depends on which density you're trying to compute!

Bob Carpenter

unread,
Aug 14, 2015, 2:50:34 PM8/14/15
to stan-...@googlegroups.com

> On Aug 14, 2015, at 2:45 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:

...

> So, there's two questions I guess. first off, is why does Bob say that
>
> exp((x[i]-a)/b) ~ normal(y,sigmay);
>
> "requires a jacobian correction" but
>
> y[i] ~ normal(exp((x[i]-a)/b),sigmay)
>
> "does not require a jacobian" even though normal_log is symmetric in the first two arguments

The sampling notation, as used in Bayesian stats papers, implies
different models, because the ~ is read as describing a generative
process where the right-hand side parameters are known and the left
is generated.

But that is NOT how Stan works. In Stan, these two expressions produce
identical results up to names of parameters in error messages.

I'll reiterate that the thing to do is understand how Stan
statements get translated to log densities.

The key here is that

y ~ foo(theta);

is translated as

increment_log_prob(foo_log(y, theta));

This is just syntactic sugar other than for the fact that the sampling
notation with ~ drops constant terms.

After that, it's up to you to apply whatever Jacobians you need.

- Bob

Bob Carpenter

unread,
Aug 14, 2015, 2:53:40 PM8/14/15
to stan-...@googlegroups.com

> On Aug 14, 2015, at 11:46 AM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
>
> I apologize for any frustrations here. Part of my frustrations are that I'm used to being able to translate math into code more freely. Stan is a very high level language, so it's sometimes worrisome that small differences in expressions might lead to large differences in results.
>
> Going back to the exponential model, I'll explain the math I'm trying to achieve. y and x are data vectors, a, and b are parameters, I need to specify a joint density:
>
> p(y,x,a,b);
> I factorize it as:
>
> p(y|x,a,b) p(x|a,b) p(a)p(b) (since a and b are independent in my model)
>
> now, p(y|x,a,b) = product(i, normal(y[i]-exp((x[i]-a)/b), sigmay)

There's a missing argument for normal here. Did you mean normal(y[i] | -exp(...), sigma)?

> p(x|a,b) = 1 (the probability of the data x is just 1, the data just *is*)

I take it x is a predictor here. In which case, you can just code p(y, a, b | x)
and forget about x.

> p(a) = normal(a*,sigma_a); /* some kind of prior*/
> p(b) = normal(b*,sigma_b); /* some kind of prior*/

Is everything else a constant --- I don't see sigmay, a*, b*, sigma_a, and sigma_b
in your probability function.

- Bob

Daniel Lakeland

unread,
Aug 14, 2015, 3:00:02 PM8/14/15
to Stan users mailing list


On Friday, August 14, 2015 at 11:50:34 AM UTC-7, Bob Carpenter wrote:

....
 
But that is NOT how Stan works.   In Stan, these two expressions produce
identical  results up to names of parameters in error messages.

I'll reiterate that the thing to do is understand how Stan
statements get translated to log densities.

The key here is that

  y ~ foo(theta);

is translated as

  increment_log_prob(foo_log(y, theta));

This is just syntactic sugar other than for the fact that the sampling
notation with ~ drops constant terms.

After that, it's up to you to apply whatever Jacobians you need.


Great! so since normal_log is symmetric in its first two arguments, either way I code it

exp((x[i]-a)/b) ~ normal(y[i],ysigma);

or

y[i] ~ normal(exp(x[i]-a)/b);

or

exp((x[i]-a)/b)-y[i] ~ normal(0,ysigma);

all gives the same result!


Daniel Lakeland

unread,
Aug 14, 2015, 3:03:44 PM8/14/15
to Stan users mailing list

yes, let's just assume a*,b*,sigmas etc are all known constants. I'm trying to pare-down the issue to the bone. Also I edited the post here on Google Groups to reflect the typo you mention (missing argument to normal).


Bob Carpenter

unread,
Aug 14, 2015, 3:19:25 PM8/14/15
to stan-...@googlegroups.com
OK, now assuming all those asigma and sigma_a and a* and amean line up,
then yes, you can code it the way you suggest in Stan.

- Bob

Daniel Lakeland

unread,
Aug 15, 2015, 6:17:27 PM8/15/15
to Stan users mailing list
Bob, thanks for your help here. I coded up my example in R with rstan, and ran the model, I use z as a transformed parameter, a vector. and tried both:

z ~ normal(y,2)

and

y ~ normal(z,2)

and the same random seed, and verified empirically that I get exactly the same samples (once i sort them in order, since rstan seems to randomize the order when it extracts them).

This is as it should be.

Then I sat down and tried to figure out what the stan error message even means. Remember, z is a vector z[i] = exp((x[i]-a)/b).

when I write

y ~ normal(z,2)

Stan is perfectly happy, but

z ~ normal(y,2)

stan spits out the error message:

"Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If so, you need to call increment_log_prob() with the log absolute determinant of the Jacobian of the transform."

So I sat down to blindly carry out this task just to see what happened... and immediately came up against the fact that the transform takes a,b to a scalar. So the jacobian of the transform is a vector, and the determinant of a vector is not defined.

So, what possible mathematical meaning can this Stan warning message have in this specific context?

I think this topic is important enough that I'd like to offer to write some couple of paragraphs on this topic for the manual, if we can come up with a consensus about what to say for this type of case.

In particular, in my mind I was thinking of this as defining

p(y | a,b,x)

which as Michael Betancourt points out, should be thought of as a measure p(y|a,b,x) dy, but y is an observed value!
However, I wrote it in Stan as if it were:

p(z | y,a,b,x)

in which case it's actually meaningless mathematically, since p(z|a,b,x) = delta_function(z-exp((x-a)/b)) in my model, not what I told stan.

So... should we put something in the manual about it being bad coding practice to write data likelihoods with the unknowns on the left side and the knowns on the right side? And that it will trigger warnings about jacobians, but that they are actually spurious in that case?


Bob Carpenter

unread,
Aug 15, 2015, 10:03:14 PM8/15/15
to stan-...@googlegroups.com
Jacobians are always square matrices when you do changes of variables
because you map N variables to N different variables. This is just
basic calculus. See, for example:

http://mathworld.wolfram.com/ChangeofVariablesTheorem.html

https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant

In the Wikipedia article, look at the change of variables examples 5.2 and 5.3.
I'd point to the change of variables page, but it's all stated in
terms of differentiable manifolds! The manifold perspective does tell
you why it's always a square matrix, but that's a little further than
we need to go in our manual.

It's too hard (technically impossible given Stan's Turing
completeness) to figure out if a Jacobian adjustment is required. So
Stan spits out warnings whenever it can't easily deduce that the variable
transform is linear.

Statistics is about treating observations as random! When we write
a linear regression y[i] = a + b * x[i] + epsilon[i], we're treating
y as if it might have been otherwise given x. The randomness here is
in the epsilon[i] ~ normal(0, sigma) term. So it's not about what goes
on the left-hand or right-hand side of a sampling statement.

The key to understanding Stan is this:

*** the sampling statement in Stan is syntactic shorthand for
incrementing the log probability ***

That's why a ~ normal(b, c) and b ~ normal(a, c) are synonymous.
We just don't want to flag every increment_log_prob() statement as
needing a Jacobian.

And it's not about data vs. parameters, it's about which density you're
trying to define. The lognormal is still
the lognormal (which includes a Jacobian adjustment to the normal),
even if all the values are known (outcome, location, and scale).
Same with inverse gamma compared to gamma.

The reason we treat a ~ normal(b, c) differently than b ~ normal(a, c)
even though they compile to the same code is that the interpretation
in conventional Bayesian modeling is different. We think of the the
variable on the left of the ~ as being generated from the parameters
on the right of the ~. But this is just convention for writing models
in papers using vague notation. Or writing models in BUGS, where the
two are actually different in graphical modeling terms because of
the direction of the arrows.

- Bob

Michael Betancourt

unread,
Aug 16, 2015, 9:58:14 PM8/16/15
to stan-...@googlegroups.com
Let me be very, very blunt. If it’s not obvious why you can
ignore the Jacobian here then you shouldn’t risk it. Densities
are subtle and vulnerable to abuse — so don’t abuse them!

In general we are trying to construct a probability distribution
which assigns probability to neighborhoods of the target space,
\Omega,

p(A) \in [0, 1] for all A \subset \Omega.

This is a pretty abstract idea, so what we typically do is define a
parameterization of \Omega, i.e we say that \Omega looks like
R^{n} and we select n coordinate functions, x, to specify any point
in \Omega. Then we can decompose our probability distribution
into a probability density and a Lebesgue measure,

p = p(x) d^{n} x

which can be read as “probability mass equals probability density
times volume”. The probability of a neighborhood is then given
by

p(A) = \int_{A} p(x) d^{n} x.

Okay, now one of the biggest confusions with densities is that
they’re not unique. We can choose any other parameterization,
y, to give a different decomposition

p = p(y) d^{n} y
p(A) = \int_{A} p(y) d^{n} y.

Now even though p(x) and p(y) might be very different they
have to be related to each other — in particular the integrals
have to be the same for any neighborhood,

\int_{A} p(x) d^{n} x = \int_{A} p(y) d^{n} y

which implies that

p(x) d^{n} x = p(y) d^{n} y

or

p(x) = p(y(x)) | d^{n}y / d^{n} x |

where | d^{n}y / d^{n} x | is the Jacobian. So if you’re given
a distribution for y and you want to know the distribution for
x then you sub in y(x) and add the corresponding Jacobian.

In Stan this manifests whenever we are specifying a prior.

theta ~ distribution_name(hyperparameters)

is fine, but

f(theta) ~ distribution_name(hyperparameters)

is assigning a distribution to f(theta) and requires the Jacobian.

But that would be too easy in of itself. Because in Bayes Theorem
we specify a posterior with two terms — the prior and the likelihood.
The prior behaves is a density and so behaves as above, but the
likelihood is a different object. The likelihood is a conditional
probability distribution over data and when we plug in a particular
measurement it’s just a function with respect to the parameters.
That means likelihood terms _don’t_ get Jacobians.

Usually this is pretty clear, because by definition likelihoods define
how the data are sampled and we’d always write

data ~ distribution_name(theta, hyperparameters).

Because theta is on the right-hand side writing

data ~ distribution_name(f(theta), hyperparameters)

wouldn’t throw a Jacobian warning. Similarly, because data don’t
trigger Jacobian warnings

f(data) ~ distribution_name(theta, hyperparameters)

would also be fine.

What’s happening here is that Daniel Lakeland is abusing the
symmetry of the Gaussian density to write

data ~ normal(f(theta), hyperparameters)

as

f(theta) ~ normal(data, hyperparameters)

The latter doesn’t really make sense and abuses the intended
meaning of the ~ notation (which ultimately is allowed only because
the ~ notation is somewhat ill-defined in the first place). For any
other distribution this switch wouldn’t make sense and not needing
the Jacobian would be obvious.

So

1) If you don’t know probability theory then don’t try to be a probability hero.
2) When specifying priors keep the parameters on the left-hand side.
3) When specifying likelihoods keep the data on the left-hand side
and parameters on the right-hand side.

Bob Carpenter

unread,
Aug 16, 2015, 10:58:56 PM8/16/15
to stan-...@googlegroups.com
Thanks, Michael. I still remember the first time
you explained this to me and the whole Jacobian thing
finally clicked.

For everyone else's benefit:

The data-generating process is complicated a bit
by transforms. For example, I can generate a lognormal
distributed variable y by generating a normal
variable x ~ normal(0, 1) and defining y = exp(x). The
other way around is a bit trickier.

In Stan, if y is some positive-constrained data, I can write

log(y) ~ normal(0, 1)

and that will ensure y gets a lognormal distribution. But
it might not be for reasons people expect.

If I want to know what the density of y is, as opposed
to log(y), then I need the Jacobian correction to the normal
density. That is, I want to write

y ~ lognormal(0, 1),

where

lognormal(y | 0, 1) = normal(log(y) | 0, 1) * |d/dx log(y)|.

Even if y is "data". But, when I actually look at the Jacobian,

|d/dx log(y)| = | 1 / y |,

and recall that y is data, then I can drop the Jacobian because
for MCMC and optimization, we only need to definine the density up
to a proportion. At least that's my understanding of why we can get
away with writing just

log(y) ~ normal(0, 1)

instead of

y ~ lognormal(0, 1).

In Stan, if I write y ~ lognormal(0, 1), what will happen under
the hood is that the constant term will get dropped because we
don't need it for optimization or MCMC.

- Bob

P.S. Could you teach a probability theory class I can attend?
And maybe use that awesome math writing style from Baez's intro
to manifolds in his knots and gauge theory book? I've never taken a
probability or stats class, but the problem I find with books I've
perused on the topic is that they lose the forest for the trees.
Or maybe I'll just take the Gelman approach and draft such a book
myself, put your name on it, and let you correct all my misconceptions.

Bill Harris

unread,
Aug 16, 2015, 11:06:56 PM8/16/15
to stan-...@googlegroups.com
... or can you recommend a good book / monograph / article on probability theory that support model development?  My introduction years ago was The Theory of Applied Probability by Richard Dubes (1968). I'm certainly content to refer to it when I detect I need refreshing, but you may have significantly better suggestions.

Thanks,

Bill 

Bill Harris

unread,
Aug 16, 2015, 11:08:55 PM8/16/15
to stan-...@googlegroups.com
On Sun, Aug 16, 2015 at 7:58 PM, Bob Carpenter <ca...@alias-i.com> wrote:

Or maybe I'll just take the Gelman approach and draft such a book
myself, put your name on it, and let you correct all my misconceptions.


Cunningham's law to the fore!

Bill 

Daniel Lakeland

unread,
Aug 16, 2015, 11:19:56 PM8/16/15
to Stan users mailing list


 
But that would be too easy in of itself.  Because in Bayes Theorem
we specify a posterior with two terms — the prior and the likelihood.
The prior behaves is a density and so behaves as above, but the
likelihood is a different object.  The likelihood is a conditional
probability distribution over data and when we plug in a particular
measurement it’s just a function with respect to the parameters.
That means likelihood terms _don’t_ get Jacobians.


Right, the data is fixed, there are no neighborhoods involved. Likelihoods are actually weird objects, because we talk about them like they're probabilities, but before you see the data, the probability that you'll get that exact data is 0 (assuming continuous distributions).

Usually this is pretty clear, because by definition likelihoods define
how the data are sampled and we’d always write

data ~ distribution_name(theta, hyperparameters).

Because theta is on the right-hand side writing

data ~ distribution_name(f(theta), hyperparameters)

wouldn’t throw a Jacobian warning.  Similarly, because data don’t
trigger Jacobian warnings

f(data) ~ distribution_name(theta, hyperparameters)

would also be fine.

What’s happening here is that Daniel Lakeland is abusing the
symmetry of the Gaussian density to write

data ~ normal(f(theta), hyperparameters)

as

f(theta) ~ normal(data, hyperparameters)

The latter doesn’t really make sense and abuses the intended
meaning of the ~ notation (which ultimately is allowed only because
the ~ notation is somewhat ill-defined in the first place).  For any
other distribution this switch wouldn’t make sense and not needing
the Jacobian would be obvious.


Note, I wasn't TRYING to abuse the notation, I simply had a different view of what that notation means than you guys and that's not too surprising given that ~ is a pretty vague concept in the literature.

My thought was: "if I correct this data in a certain way, it will be equal to the expression for the error term, and then I want that corrected data to be a normal(0,sigma) random variable"

That is, I was doing a nonlinear regression by saying in my head:  y - myfunction(x,a,b) = epsilon

and was shocked that Stan would be telling me to change my model, and then when I thought about it, the request for a jacobian wasn't even a well defined request.

Also note, I got the idea that this was a reasonable way to write Stan code (with regression predictions on the left side of the ~) from some examples that Andrew Gelman wrote on his blog!!!


Michael Betancourt

unread,
Aug 16, 2015, 11:22:58 PM8/16/15
to stan-...@googlegroups.com
I haven’t found a book that covers the necessary theory
without losing the context of application and practice.
If anyone finds me a job that lasts for more than a few
years then a probability book would be the next big project
(along with a comprehensive HMC book)...

Bob Carpenter

unread,
Aug 16, 2015, 11:58:26 PM8/16/15
to stan-...@googlegroups.com

> On Aug 16, 2015, at 11:19 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
>
>
>
> But that would be too easy in of itself. Because in Bayes Theorem
> we specify a posterior with two terms — the prior and the likelihood.
> The prior behaves is a density and so behaves as above, but the
> likelihood is a different object. The likelihood is a conditional
> probability distribution over data and when we plug in a particular
> measurement it’s just a function with respect to the parameters.
> That means likelihood terms _don’t_ get Jacobians.
>
>
> Right, the data is fixed, there are no neighborhoods involved. Likelihoods are actually weird objects, because we talk about them like they're probabilities,

It's a common misconception, but nobody should be talking about densities as
if they're probabilities --- they're not even constrained to be less than 1!
Probability theory and math stats books are very careful to distinguish
event probabilities from densities. There's even a different notation with a
capital "P" (or "Pr") for event probabilities.

> but before you see the data, the probability that you'll get that exact data is 0 (assuming continuous distributions).

The trick to doing stats is that even after we see the data, we treat it
as if it might have been different.

...

> I simply had a different view of what that notation means than you guys and that's not too surprising given that ~ is a pretty vague concept in the literature.

It's very precisely defined in Stan. When you write

y ~ foo(theta);

what happens is that it's equivalent to writing

increment_log_prob(foo_log(y, theta));

with the exception that the ~ notation drops constant terms.

Writing

y ~ normal(mu, sigma);

where y is data and mu and sigma are parameters produces the exact
same effect as:

increment_log_prob(-0.5 * ((y - mu) / sigma)^2 - log(sigma))

where we drop that pesky sqrt(2 * pi) term. If I write

beta ~ normal(0, 1 / sqrt(2));

then I get the exact same behavior as writing

increment_log_prob(-y^2);

Stan only cares about the density defined on the parameters up to
a proportion. Everything else follows from that.

> My thought was: "if I correct this data in a certain way, it will be equal to the expression for the error term, and then I want that corrected data to be a normal(0,sigma) random variable"
>
> That is, I was doing a nonlinear regression by saying in my head: y - myfunction(x,a,b) = epsilon
>
> and was shocked that Stan would be telling me to change my model, and then when I thought about it, the request for a jacobian wasn't even a well defined request.
>
> Also note, I got the idea that this was a reasonable way to write Stan code (with regression predictions on the left side of the ~) from some examples that Andrew Gelman wrote on his blog!!!

I'll insert an editorial note into the blog post if you remember where
it is.

I do love Andrew and Jennifer's book in part because they provide great
translation keys.

- Bob

Michael Betancourt

unread,
Aug 17, 2015, 12:52:33 AM8/17/15
to stan-...@googlegroups.com

My thought was: "if I correct this data in a certain way, it will be equal to the expression for the error term, and then I want that corrected data to be a normal(0,sigma) random variable"

That is, I was doing a nonlinear regression by saying in my head:  y - myfunction(x,a,b) = epsilon

and was shocked that Stan would be telling me to change my model, and then when I thought about it, the request for a jacobian wasn't even a well defined request.

But that’s not _generative_.  The generative picture is

1) generate truth, myfunction(x,a,b)
2) generate noise, epsilon ~ normal(0, sigma)
3) generate measurement, y = myfunction(x,a,b) + epsilon

Hence

y = myfunction(x,a,b) + epsilon
y ~ normal(myfunction(x,a,b), sigma)

On the other hand,

myfunction(x,a,b) ~ normal(y, sigma)

is not generative!  You can write it down, but it’s certainly
not a likelihood contribution in general and potentially
dangerous, hence the warning.

Avraham Adler

unread,
Aug 17, 2015, 2:27:25 PM8/17/15
to Stan users mailing list
Maybe we can get Michael to compromise, tie him to a chair at the next Stan meetup he attends, and force him to record a few videos on the topic?

In that vein, speaking as a statistical practicioner, not theoretician, I very much appreciate the clarity with which Drs. Gelman, Carpenter, Betancourt, Goodrich, etc. answer these questions!

Avi

Daniel Lakeland

unread,
Aug 18, 2015, 12:58:00 PM8/18/15
to Stan users mailing list


On Sunday, August 16, 2015 at 8:58:26 PM UTC-7, Bob Carpenter wrote:

> On Aug 16, 2015, at 11:19 PM, Daniel Lakeland <dlak...@street-artists.org> wrote:
>
>
>
>  
> But that would be too easy in of itself.  Because in Bayes Theorem
> we specify a posterior with two terms — the prior and the likelihood.
> The prior behaves is a density and so behaves as above, but the
> likelihood is a different object.  The likelihood is a conditional
> probability distribution over data and when we plug in a particular
> measurement it’s just a function with respect to the parameters.
> That means likelihood terms _don’t_ get Jacobians.
>
>
> Right, the data is fixed, there are no neighborhoods involved. Likelihoods are actually weird objects, because we talk about them like they're probabilities,

It's a common misconception, but nobody should be talking about densities as
if they're probabilities --- they're not even constrained to be less than 1!
Probability theory and math stats books are very careful to distinguish
event probabilities from densities.  There's even a different notation with a
capital "P" (or "Pr") for event probabilities.  

I wasn't referring to the difference between a density and a probability so much, more that even if you multiply by a fixed differential of the outcome space to get a probability, you evaluate this AT the points where the data fell, these points never change. After plugging in the data, the whole thing becomes a function only of the parameters!

Now that we've been through this discussion, I see the thing that was so confusing to me about Stan's warning was that I was thinking of my line of code as providing likelihood information, but Stan was telling me to alter my likelihood, in other words, alter my model of the world! But I realize now that it was because I was using the Stan notation in a way Stan doesn't really expect.

Before plugging in the data values, you can think of the function as a probability distribution over the data, but it doesn't represent the probability of the actual data obviously, it represents your knowledge of what the data would be conditional only on the parameter values and the structure of your model. Thinking of this as the data "could have been different" is sort of a fundamentally frequentist way of thinking about the data. The sort of philosophically Bayesian viewpoint is that before seeing the data the likelihood can be thought of as a probability distribution describing your information about where the data will lie, conditional on the parameter value. After seeing the data, it's a sort of anchor or tilt on the probability of the parameters since it's purely a function of the parameters.


> I simply had a different view of what that notation means than you guys and that's not too surprising given that ~ is a pretty vague concept in the literature.

It's very precisely defined in Stan.  When you write

  y ~ foo(theta);

what happens is that it's equivalent to writing

  increment_log_prob(foo_log(y, theta));

with the exception that the ~ notation drops constant terms.


Yes, so this makes my model mathematically correct (I would get the mathematical results I wanted). But it seems clear from Michael's posts and the warning messages, that the ~ notation is intended to be thought of in a "generative modeling" manner, and that if you use this notation in another way you may get warnings which don't make sense, and you may confuse other people when they go to read your stan code. So from now on, I'll put my data on the left and functions of parameters and data on the right. Note that when the density isn't symmetric in the first two arguments you HAVE to do that anyway, otherwise you probably won't get the results you want.

My comments above about nonlinear transforms of the parameters should be taken in that light, it's perfectly fine to take some parameters, put them through some nonlinear function, and use that as predictions of the data, without somehow "correcting for" the nonlinearity in the transform, because in that process you're constructing a likelihood, you're describing what you think the "truth" is in Michaels terminology. But if you write that as:

foofunction(data,params) ~ distribution(data,other,parameters)

even if this is mathematically equivalent to:

data ~ distribution(foofunction(data,params),other,params)

you will get Stan warnings about the nonlinear foofunction because Stan doesn't expect you to use ~ in that way. Best to just use the software in the way it was intended I think.



> Also note, I got the idea that this was a reasonable way to write Stan code (with regression predictions on the left side of the ~) from some examples that Andrew Gelman wrote on his blog!!!

I'll insert an editorial note into the blog post if you remember where it is.  

I don't remember, but I think it was one of those "how tall is...." posts. You may have even had some editorial remarks already now that I think about it.


Daniel Lakeland

unread,
Aug 18, 2015, 1:52:24 PM8/18/15
to Stan users mailing list



On Sunday, August 16, 2015 at 8:58:26 PM UTC-7, Bob Carpenter wrote:


I'll insert an editorial note into the blog post if you remember where
it is.  

I do love Andrew and Jennifer's book in part because they provide great
translation keys.

- Bob

http://andrewgelman.com/2014/09/29/general-principles-bayesian-data-analysis-arising-stan-analysis-john-lee-andersons-height/

Yes, you did in fact make some mention about the need for jacobians, but not a mention about defining likelihoods this way, and this violating the "generative" assumption about what stan thinks ~ means.

In that example, Andrew defines his likelihood in the form

linear_function_of_priors ~ normal(constant_value_that_is_actually_observed_data, constant_standard deviation)

talk about terrible programming practices! :-)


 

Bob Carpenter

unread,
Aug 18, 2015, 2:13:55 PM8/18/15
to stan-...@googlegroups.com
The notation's a little vague, but I don't think Andrew would think of
the shoe sizes as priors. They're an unknown for sure. But then so is
a prediction for a new data point. "Prior" is usually reserved for
priors on parameters. Stan uses "parameters" for any unobserved variable,
which is admittedly confusing.

- Bob

Daniel Lakeland

unread,
Aug 18, 2015, 2:35:03 PM8/18/15
to Stan users mailing list


The notation's a little vague, but I don't think Andrew would think of
the shoe sizes as priors.  They're an unknown for sure.  But then so is
a prediction for a new data point.  "Prior" is usually reserved for
priors on parameters.  Stan uses "parameters" for any unobserved variable,
which is admittedly confusing.  

- Bob

Whoops, sorry, that's just a word-o I meant, and should have said, linear_function_of_parameters not _of_priors
Reply all
Reply to author
Forward
0 new messages