Constraint violations and divergent transitions

174 views
Skip to first unread message

Eric Buhle

unread,
Jun 13, 2016, 3:46:08 PM6/13/16
to Stan users mailing list
Hi all,

I'm working on a state-space fisheries model in which the catch (declared as data) is subtracted from the population size ("recruits", declared as a parameter), and the difference ("spawners", a transformed parameter) is required to be positive. As a cartoon, I have something like this:

data {
  vector
<lower=0>[N]  C;
}

parameters
{
  vector
<lower=0>[N] R;
}

transformed parameters
{
  vector
<lower=0>[N] S;

  S
<- R - C;
}
. . .

Of course my actual model is much more complicated, with age structure, lots of hierarchical elements, etc., but I'm hoping for some fairly generic answers and guidance here. I'll point out that this basic setup applies to a wide range of mass-balance type problems, including pretty much any fisheries model.

What happens, of course, is that Stan rejects any draw that violates the constraint on S and spits out an "Informational Message" to that effect (I'm using rstan 2.9.0 in RStudio, by the way). The smaller the difference between one or more corresponding elements of R and C, the smaller the feasible region of parameter space, hence the more frequently trajectories fall into the "hole". In my case, I get a ton of these rejections. I also get a very large number of post-warmup divergent transitions, which is really what concerns me. I'm almost certain the divergences are related to the constraint violations, because they disappear if I tweak the model to avoid taking the difference of R and C:

data {
  vector
<lower=0,upper=1>[N]  h;  // harvest rate estimated from observed data
}

parameters
{
  vector
<lower=0>[N] R;
}

transformed parameters
{
  vector
<lower=0>[N] S;

  S
<- R*(1 - h); // ensures S > 0, but not a satisfactory solution because in reality C, not h, is known
}
. . .

I've read several previous threads that have touched on related issues (e.g., this one), and the discussion has been very helpful. What I've gleaned is that it really isn't a good idea to enforce constraints on transformed parameters in this way, by allowing Metropolis rejections when the constraints aren't met. Instead, I gather that I ought to reparameterize so that the constrained difference (S) is the primitive parameter, thereby making the parameter space a dense set. I would then work backward to R and include the Jacobian of that transformation when I place a process model likelihood on R in the model block. In the toy example above, this would be trivial and would probably work fine. In my actual model it's much less trivial, but I've tried to implement it. Unfortunately, the reparameterized version has some pathology I've been unable to root out; sampling is extremely slow and the chains barely move.

I'm left wondering whether the original model, Informational Messages and divergences and all, is really that bad. The results otherwise look fine; Rhats and traceplots show no problems, and the estimates make sense. Also, I can reduce (but not eliminate) the divergences by increasing adapt_delta to, say, 0.99, and when I do this I cannot see any change in the posterior summary statistics above MC error. This makes me think that any bias caused by the divergences must be pretty minor.

So at this point I have a few specific/algorithmic questions and a more general one:

(1) I'm trying to understand the relationship between constraint violations and divergent transitions. Partly this reflects my fuzziness on what exactly causes divergences at a computational level, but suppose that somewhere along the NUTS tree, the integrator encounters a divergence. Which point gets retained for that iteration of the HMC chain? The last point (i.e., the chain doesn't move)? Or a slice sample from the points along the path before the divergence occurred? Or...? Ditto for the case when one of the points on the tree violates parameter constraints -- which point gets retained then?

(2) Does a constraint violation necessarily (or usually) produce a divergence? If I know the divergences are just telling me about the infeasible holes in parameter space, should I feel reassured about treating them as innocuous?

(3) Do you have any general guidance or best practices for situations like this, where reparameterizing to a dense set is either analytically cumbersome or computationally intractable?

Thanks for reading and for fostering such a helpful forum for discussion and feedback.


Eric Buhle

Ben Goodrich

unread,
Jun 13, 2016, 4:11:19 PM6/13/16
to Stan users mailing list
On Monday, June 13, 2016 at 3:46:08 PM UTC-4, Eric Buhle wrote:
So at this point I have a few specific/algorithmic questions and a more general one:

(1) I'm trying to understand the relationship between constraint violations and divergent transitions. Partly this reflects my fuzziness on what exactly causes divergences at a computational level, but suppose that somewhere along the NUTS tree, the integrator encounters a divergence. Which point gets retained for that iteration of the HMC chain? The last point (i.e., the chain doesn't move)? Or a slice sample from the points along the path before the divergence occurred? Or...? Ditto for the case when one of the points on the tree violates parameter constraints -- which point gets retained then?

A divergence is caused by the inability to numerically solve with sufficient accuracy the Hamiltonian differential equations for the next position. I believe in 2.9.0, a divergence causes the leapfrogging to stop, and it slices from the pre-divergent leapfrog steps. A constraint violation is different and causes the previous point to be retained.
 
(2) Does a constraint violation necessarily (or usually) produce a divergence? If I know the divergences are just telling me about the infeasible holes in parameter space, should I feel reassured about treating them as innocuous?

No. A divergence happens while the constrains are being satisfied, albeit with a transformation from the constrained space to an unconstrained space.
 
(3) Do you have any general guidance or best practices for situations like this, where reparameterizing to a dense set is either analytically cumbersome or computationally intractable?

You pretty much have to write a cumbersome Stan program. Can you make h be a vector of primitive parameters between 0 and 1 and define R as a transformed parameter that is equal to C ./ h?

Ben

Eric Buhle

unread,
Jun 13, 2016, 4:53:18 PM6/13/16
to Stan users mailing list
Hi Ben,

Thanks for the clarifications.


On Monday, June 13, 2016 at 1:11:19 PM UTC-7, Ben Goodrich wrote:
(2) Does a constraint violation necessarily (or usually) produce a divergence? If I know the divergences are just telling me about the infeasible holes in parameter space, should I feel reassured about treating them as innocuous?
 

No. A divergence happens while the constrains are being satisfied, albeit with a transformation from the constrained space to an unconstrained space.

Interesting. This leaves me wondering why the divergences disappeared in my second cartoon example. I realize the posterior geometry in the two versions will differ in ways apart from the infeasible "holes", but I wouldn't have thought it would differ so much. Could it be that part of the problem with the "holes" is that they hinder adaptation and lead to a sub-optimal step size and mass matrix, and that leads to lots of divergences?
 
 
(3) Do you have any general guidance or best practices for situations like this, where reparameterizing to a dense set is either analytically cumbersome or computationally intractable?

You pretty much have to write a cumbersome Stan program. Can you make h be a vector of primitive parameters between 0 and 1 and define R as a transformed parameter that is equal to C ./ h?

Yeah, I should have said "more cumbersome". This is a clever idea, though, and it might work. Things are more complicated than in my toy example because R and S are actually year x age arrays, but I think the same logic applies. I've actually spent the last few days trying something similar but not quite the same: declare both R and h as primitive parameters and estimate h by matching C (which in this case is essentially known without error) as closely as possible. To do this, you have to define a likelihood with a very high precision, say

C ~ lognormal(log(R .* h), 0.001)

This works and gives sensible answers, but it is excruciatingly slow (compared to the original version) and still has some divergences. I'll think about your idea of deterministically solving for R and see if I can get it to work.

I'd still be interested in hearing from anyone else who's grappled with the same basic issue and found solutions that work (or don't).

Thanks,
Eric

Ben Goodrich

unread,
Jun 13, 2016, 5:01:01 PM6/13/16
to Stan users mailing list
On Monday, June 13, 2016 at 4:53:18 PM UTC-4, Eric Buhle wrote:
Hi Ben,

Thanks for the clarifications.

On Monday, June 13, 2016 at 1:11:19 PM UTC-7, Ben Goodrich wrote:
(2) Does a constraint violation necessarily (or usually) produce a divergence? If I know the divergences are just telling me about the infeasible holes in parameter space, should I feel reassured about treating them as innocuous?
 

No. A divergence happens while the constrains are being satisfied, albeit with a transformation from the constrained space to an unconstrained space.

Interesting. This leaves me wondering why the divergences disappeared in my second cartoon example. I realize the posterior geometry in the two versions will differ in ways apart from the infeasible "holes", but I wouldn't have thought it would differ so much. Could it be that part of the problem with the "holes" is that they hinder adaptation and lead to a sub-optimal step size and mass matrix, and that leads to lots of divergences?

Yes.

Ben

Ben Goodrich

unread,
Jun 13, 2016, 5:05:46 PM6/13/16
to Stan users mailing list
On Monday, June 13, 2016 at 4:53:18 PM UTC-4, Eric Buhle wrote:
Yeah, I should have said "more cumbersome". This is a clever idea, though, and it might work. Things are more complicated than in my toy example because R and S are actually year x age arrays, but I think the same logic applies. I've actually spent the last few days trying something similar but not quite the same: declare both R and h as primitive parameters and estimate h by matching C (which in this case is essentially known without error) as closely as possible. To do this, you have to define a likelihood with a very high precision, say

C ~ lognormal(log(R .* h), 0.001)

This works and gives sensible answers, but it is excruciatingly slow (compared to the original version) and still has some divergences. I'll think about your idea of deterministically solving for R and see if I can get it to work.

This is not great because, locally at least, you can increase an element of R by a small amount and decrease an element of h by an offsetting amount without changing the likelihood of the corresponding element of C. I think you need R to be conditionally deterministic given h.

Ben




Eric Buhle

unread,
Jun 13, 2016, 5:20:15 PM6/13/16
to Stan users mailing list

Ah, good point. I hadn't looked at the posterior correlation between R and h, just the marginal of h (which is very tight, as you'd expect).

After reflecting on your suggestion a little more, I think it's similar to but not quite the same as the one I sketched in the OP ("reparameterize to a dense set"). I've gone about that by declaring S as a primitive parameter and solving for R, essentially as R <- S + C. (It's uglier than that because, again, age structure.) That's the version that takes forever and doesn't even begin to mix at all. Maybe the geometry would be better-behaved if I did what you suggest, declaring h (and the age structure matrix p_age) as parameters and computing R[i,a] <- p_age[i,a] * C[i] / h[i]. I'll give it a try.

Thanks again,
Eric

Michael Betancourt

unread,
Jun 13, 2016, 6:23:52 PM6/13/16
to stan-...@googlegroups.com
Deep down in the code, constraint violations actually do trigger divergences.
What happens when a constraint is violated is that an exception is thrown
and caught by the MCMC code which sets the log posterior density to
zero (potential to infinity) which the sampler then interprets as a divergence
(technically this _is_ a divergence, albeit an artificial one given the imposed
constraint).  We could potentially consider changing this behavior, although
it would require some changes to the architecture of the MCMC code.

But there are deeper problems here.  In particular, I can guarantee you that
you do not know C as well as you claim to do, because if you did then S would
guaranteed to be larger than C and none of this would be a problem!  What you 
want to do is think _generatively_, with C a measurement of some background
rate, which itself is a parameter just the same as the rate for the total observation.  For 
example, a very simple model would be

data {
  int<lower=0> C_meas;
  int<lower=0> S;
}

parameters {
  real<lower=0> lambda_S;
  real<lower=0> lambds_C;
}

model {
  // priors
  …
  C_mean ~ poisson(lambda_C);
  N ~ poisson(lambds_C + lambds_S);
  … etc
}

_Anytime_ you’re subtracting a “background” you’re asking for trouble 
and need to flip your model around to be generative to avoid these kinds
of issues.  Even when you have more complex covariate structure — it
will always be easier to implement generatively.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric Buhle

unread,
Jun 13, 2016, 7:39:51 PM6/13/16
to Stan users mailing list
Hi Michael,

Thanks; that goes a long way toward clearing up my understanding of divergence vis a vis constraint violations.

I do effectively know C with certainty in this case, although that may not be true in more general situations. I can't see a way to incorporate that knowledge into bounds on declared parameters, however, because (1) while I've been trying to spare everyone the tedious arithmetic, what I've been calling R is actually a function of two lower-level parameters (let's call them R_tot, a vector, and p_age, an array of simplexes), so the constraint amounts to the solution set of a system of linear inequalities which make my head hurt, (2) even if I could work out the solution, it would give me constraints on one primitive parameter that depend on another primitive parameter, and (3) even in the trivial case of my toy example, you would have different lower bounds on different elements of R, which Stan doesn't support.

In the more general case, you're right that it would make sense to assign C_meas a likelihood component and estimate its observation error variance instead of fixing it like I described above. I haven't tried that, mainly because then the model wouldn't reflect what I know about the problem.

Thanks,
Eric

Bob Carpenter

unread,
Jun 14, 2016, 1:36:04 AM6/14/16
to stan-...@googlegroups.com
To get around (2) and (3), it's possible to declare a vector
of unconstrained parameters and do the transforms yourself
to a constrained vector. These constraints can depend on
each other (as in a simplex) as long as you can compute the
transform and its Jacobian. That can be a pain, though, even
when it is possible.

If you're dealing with an underlying simplex,
you might want to check if some of the components are collapsing
to zero and you definitely should check that the result is
identified; if there are non-identifiabilities, Stan can try
really hard to explore a very large space of possibilities,
sometimes triggering numerical issues.

- Bob

Michael Betancourt

unread,
Jun 14, 2016, 4:46:14 AM6/14/16
to stan-...@googlegroups.com
I do effectively know C with certainty in this case, although that may not be true in more general situations. I can't see a way to incorporate that knowledge into bounds on declared parameters, however, because (1) while I've been trying to spare everyone the tedious arithmetic, what I've been calling R is actually a function of two lower-level parameters (let's call them R_tot, a vector, and p_age, an array of simplexes), so the constraint amounts to the solution set of a system of linear inequalities which make my head hurt, (2) even if I could work out the solution, it would give me constraints on one primitive parameter that depend on another primitive parameter, and (3) even in the trivial case of my toy example, you would have different lower bounds on different elements of R, which Stan doesn't support.

I’m not sure we’re on the same page.  All of those constraints and linear inequalities?
Those are clear indicators that you’re trying to invert some previously measured data
to some deterministic determination of C.  This is one of the most common mistakes
scientists make when trying to build probabilistic models and it leads to no end of
common problems.  If you build up everything generatively then any constraints will
be straightforward to implement and the complexly structured data modeled implicitly
by the forward model that combines all of those simple constraints together.

Again, if you knew C exactly then S would always be greater than C and the naive
subtraction would work without any issues whatsoever.  The fact that this constraint
doesn’t always hold indicates that you don’t know C as exactly as you claim (most
likely your system for C is ill-posed and there isn’t even a unique value in the case
of infinite data).  

In the more general case, you're right that it would make sense to assign C_meas a likelihood component and estimate its observation error variance instead of fixing it like I described above. I haven't tried that, mainly because then the model wouldn't reflect what I know about the problem.

That was just a simple example of a forward model to replace naive “background
subtraction”.  In your case you wouldn’t want to follow this exactly but rather replace
the likelihood for C_meas with your complex generative model.

Eric Buhle

unread,
Jun 14, 2016, 12:03:43 PM6/14/16
to Stan users mailing list
Hi Bob,

Thanks for the input.


On Monday, June 13, 2016 at 10:36:04 PM UTC-7, Bob Carpenter wrote:
To get around (2) and (3), it's possible to declare a vector
of unconstrained parameters and do the transforms yourself
to a constrained vector.  These constraints can depend on
each other (as in a simplex) as long as you can compute the
transform and its Jacobian.  That can be a pain, though, even
when it is possible.

Is there an example of this somewhere? I think this is basically what I did when I declared S (here a matrix) as a parameter and transformed to R_tot and p_age. The latter two transformed parameters get process model likelihoods, and I included the Jacobian of the inverse transformation. (I know I didn't explain this version very clearly, but it's the one I mentioned in the OP that takes forever and doesn't mix.) In that case, S is not actually unconstrained but rather has a lower bound of zero, but it seems like the principle is the same. But maybe you meant something different?

If you're dealing with an underlying simplex,
you might want to check if some of the components are collapsing
to zero and you definitely should check that the result is
identified;  if there are non-identifiabilities, Stan can try
really hard to explore a very large space of possibilities,
sometimes triggering numerical issues.

Thanks for pointing this out. I don't think any of the components are slammed up against zero, but I'll double-check.

-Eric

Eric Buhle

unread,
Jun 14, 2016, 12:31:13 PM6/14/16
to Stan users mailing list
Hi Michael,

It's possible I'm being dense, in which case thanks for bearing with me. But the idea of posing a generative model makes perfect sense to me iff C is measured with error. In that case, I could assign it a stochastic observation model and estimate any associated parameters (e.g., observation error variance). In my case, C is known so precisely (in an absolute sense and relative to other observed quantities) that it doesn't really make sense to treat it as uncertain. Given that, I can only think of the following alternatives to the naive subtraction:

(1) Parameterize with S (defined as R - C, which must be positive) and solve deterministically for R as R <- S + C. I tried this, so far without success. Again, I'm glossing over a lot of complexity in this cartoon version.
(2) Parameterize with h (defined as C ./ R, bounded on [0,1]) and solve deterministically for R as R <- C ./ h and for S as S <- R .* (1 - h). This was Ben's suggestion, and I'm still wrapping my head around it, but it seems promising.
(3) Adopt the generative approach, but only as a kludge to force "C_hat" to be very close to C, essentially putting a point mass prior on the observation error variance. I've been playing around with this, but Ben pointed out it wasn't a great idea.
(4) Somehow incorporate C into the declared bounds on R (or, in my actual case, the primitive parameters R_tot and p_age). I can't see how this would be done, but maybe I'm missing something.

Again, thanks for the input. Hope that clarifies things, and please let me know if I'm still missing something glaringly obvious.

-Eric

Michael Betancourt

unread,
Jun 14, 2016, 1:20:02 PM6/14/16
to stan-...@googlegroups.com
It's possible I'm being dense, in which case thanks for bearing with me. But the idea of posing a generative model makes perfect sense to me iff C is measured with error. In that case, I could assign it a stochastic observation model and estimate any associated parameters (e.g., observation error variance). In my case, C is known so precisely (in an absolute sense and relative to other observed quantities) that it doesn't really make sense to treat it as uncertain. Given that, I can only think of the following alternatives to the naive subtraction:

You say this, but then why were you having constraint problems
in the beginning?  

I agree that if C is known exactly, or close enough to exactly, then it 
wouldn’t have to be modeled.  And in that case you could write a model 
for the counts above the baseline, S = R - C, knowing that S is always 
positive.  I am being insistent on a generative model only because you
had originally said that you were having problems keeping S positive!
If I am incorrect then please correct me.

I see this all of the time in physics where background subtraction is the
norm, and the only way to do model everything correctly is to actually
model the background subtraction process explicitly.

(1) Parameterize with S (defined as R - C, which must be positive) and solve deterministically for R as R <- S + C. I tried this, so far without success. Again, I'm glossing over a lot of complexity in this cartoon version.
(2) Parameterize with h (defined as C ./ R, bounded on [0,1]) and solve deterministically for R as R <- C ./ h and for S as S <- R .* (1 - h). This was Ben's suggestion, and I'm still wrapping my head around it, but it seems promising.
(3) Adopt the generative approach, but only as a kludge to force "C_hat" to be very close to C, essentially putting a point mass prior on the observation error variance. I've been playing around with this, but Ben pointed out it wasn't a great idea.

Actually these are all generative models, just with different assumptions
about the measurement process.  (1) assumes that the signal and 
background (or baseline or whatever) is additive, which is consistent
with most data generating processes.  (2) assumes that the signal is
some fixed proportion of the background, which is a limiting case of
(1) as the counts increase towards infinity.  In particular, (1) and (2)
fall under (3) if you treat everything as data to be modeled.



Eric Buhle

unread,
Jun 14, 2016, 1:47:04 PM6/14/16
to Stan users mailing list
On Tuesday, June 14, 2016 at 10:20:02 AM UTC-7, Michael Betancourt wrote:
It's possible I'm being dense, in which case thanks for bearing with me. But the idea of posing a generative model makes perfect sense to me iff C is measured with error. In that case, I could assign it a stochastic observation model and estimate any associated parameters (e.g., observation error variance). In my case, C is known so precisely (in an absolute sense and relative to other observed quantities) that it doesn't really make sense to treat it as uncertain. Given that, I can only think of the following alternatives to the naive subtraction:

You say this, but then why were you having constraint problems
in the beginning?  

I was having constraint problems because I can't see a way to turn my knowledge of the background/baseline (C) into coherent constraints on the declared, primitive parameters. This would be a problem even in the first toy example in the OP, because different elements of C imply different lower bounds on elements of R. My understanding is that Stan doesn't support this, unless the approach that Bob alluded to is something different than what I'm thinking. In my actual model things are even worse, because the R that needs to be constrained is a function of two lower-level parameters (R_tot and p_age), so even if I could work out the constraints on the latter two algebraically, they would be interdependent.
 
I agree that if C is known exactly, or close enough to exactly, then it 
wouldn’t have to be modeled.  And in that case you could write a model 
for the counts above the baseline, S = R - C, knowing that S is always 
positive.  I am being insistent on a generative model only because you
had originally said that you were having problems keeping S positive!
If I am incorrect then please correct me.

Nope, you've got it. And this is what I tried to do, as described in the OP ("reparameterize to a dense set"), but I can't even get this version off the ground, sampling-wise. Maybe I made an error in the code, but I've spent a lot of time going over it, starting simple and building up complexity, etc. I suspect it's just that the posterior geometry is worse than in the original version, apart from the holes induced by the constraint violations.
 
I see this all of the time in physics where background subtraction is the
norm, and the only way to do model everything correctly is to actually
model the background subtraction process explicitly.

Interesting. I know pretty much nothing about statistical physics, but this is definitely the norm in fisheries models. Sometimes you'll see variants of my approach (3), where an arbitrarily steep penalty is applied to keep the likelihood surface continuous.
 

(1) Parameterize with S (defined as R - C, which must be positive) and solve deterministically for R as R <- S + C. I tried this, so far without success. Again, I'm glossing over a lot of complexity in this cartoon version.
(2) Parameterize with h (defined as C ./ R, bounded on [0,1]) and solve deterministically for R as R <- C ./ h and for S as S <- R .* (1 - h). This was Ben's suggestion, and I'm still wrapping my head around it, but it seems promising.
(3) Adopt the generative approach, but only as a kludge to force "C_hat" to be very close to C, essentially putting a point mass prior on the observation error variance. I've been playing around with this, but Ben pointed out it wasn't a great idea.

Actually these are all generative models, just with different assumptions
about the measurement process.  

Okay, got it. I think part of the confusion was just the semantics of what counts as a "generative" or "forward" model.

-Eric

Eric Buhle

unread,
Jun 14, 2016, 2:03:13 PM6/14/16
to Stan users mailing list
Hmm, it just occurred to me that it might help clear some things up if I explain the rationale for parameterizing with R instead of S in the first place. Essentially it's because the population-dynamics process model describes how R at each time step is generated from S and p_age at previous time steps, so treating R as a parameter is natural in that it allows you to place the process likelihood on a primitive parameter. If you parameterize with S and solve for R and p_age, you have to include the Jacobian correction in the likelihood statement. And, as I suggested above, I suspect doing it this way makes the posterior geometry uglier in general. 

-Eric

Michael Betancourt

unread,
Jun 14, 2016, 2:48:33 PM6/14/16
to stan-...@googlegroups.com
We can’t do anything but speculate without seeing the entire model, and given
what you imply about the model we won’t have time to look at it in the necessary
detail.

But I’ll just add a few comments.

1) There’s a difference between expected background and actual background —
if if you know the expected background exactly you can get substantial variation
from measurement to measurement.  In other words, there’s a huge difference
between

R_meas ~ Poisson(C + S)

and

C_meas ~ Poisson(C)
S_meas ~ Poisson(S)
R_meas = C_meas + S_meas.

Conflicts such as R_meas - C > S_meas almost always come form assuming
the former when it’s really the latter that is appropriate.

2) All of this carries over to time-series or space-time models.  In fact a state-space
perspective is very helpful here — the model is of the underlying rates, both for
signal and background, and the measurement model is how those rates manifest
as an explicit measurement.  The second model above is a very simple example
of this idea, and you can imagine extending it where C and S are time series (or
C is a constant assumed known and S is a time series).

In the end it’s hard to beat the Folk Theorem — if you’re having computational
problems then you’re not really fitting the right model.

Bob Carpenter

unread,
Jun 14, 2016, 3:04:47 PM6/14/16
to stan-...@googlegroups.com

> On Jun 14, 2016, at 1:47 PM, Eric Buhle <eric....@noaa.gov> wrote:
>
> On Tuesday, June 14, 2016 at 10:20:02 AM UTC-7, Michael Betancourt wrote:
>> It's possible I'm being dense, in which case thanks for bearing with me. But the idea of posing a generative model makes perfect sense to me iff C is measured with error. In that case, I could assign it a stochastic observation model and estimate any associated parameters (e.g., observation error variance). In my case, C is known so precisely (in an absolute sense and relative to other observed quantities) that it doesn't really make sense to treat it as uncertain. Given that, I can only think of the following alternatives to the naive subtraction:
>
> You say this, but then why were you having constraint problems
> in the beginning?
>
> I was having constraint problems because I can't see a way to turn my knowledge of the background/baseline (C) into coherent constraints on the declared, primitive parameters. This would be a problem even in the first toy example in the OP, because different elements of C imply different lower bounds on elements of R. My understanding is that Stan doesn't support this, unless the approach that Bob alluded to is something different than what I'm thinking.

Stan supports varying lower bounds as follows:

data {
vector[K] theta_lb;
...

parameters {
vector[K] theta_raw;
...

transformed parameters {
vector[K] theta;
theta <- exp(theta_raw) + theta_lb;
increment_log_pob(sum(theta_raw)); // log Jacobian of theta_raw -> theta

Now you have theta with varying lower bounds.
If you're just going to use theta rather than put a prior
on it directly, you won't need the Jacobian term.

- Bob



> In my actual model things are even worse, because the R that needs to be constrained is a function of two lower-level parameters (R_tot and p_age), so even if I could work out the constraints on the latter two algebraically, they would be interdependent.

That's OK, too. They just go into the constraint and you
need the Jacobian.

- Bob

Eric Buhle

unread,
Jun 14, 2016, 4:29:29 PM6/14/16
to Stan users mailing list
On Tuesday, June 14, 2016 at 11:48:33 AM UTC-7, Michael Betancourt wrote:
1) There’s a difference between expected background and actual background —
if if you know the expected background exactly you can get substantial variation
from measurement to measurement.  In other words, there’s a huge difference
between

R_meas ~ Poisson(C + S)

and

C_meas ~ Poisson(C)
S_meas ~ Poisson(S)
R_meas = C_meas + S_meas.

Conflicts such as R_meas - C > S_meas almost always come form assuming
the former when it’s really the latter that is appropriate.

I agree the former doesn't make sense, at least for my problem. I think the latter is more or less what I'm doing, but to make the analogy work I have to tweak a couple of things. First, I'm assuming no measurement noise in C, so C = C_meas. Second, R is not directly observable -- it's an intermediate state variable that gets transformed by subtracting C to obtain S, which is observable -- so there is no R_meas. So something like

R[t] ~ f(S[t-1])
S[t] = R[t] - C[t]
S_meas[t] ~ g(S[t]).

-Eric

Eric Buhle

unread,
Jun 14, 2016, 4:34:02 PM6/14/16
to Stan users mailing list
Hi Bob,

Thanks for the example. That makes sense, and I think it's essentially what I did in my reparameterized version (the one that doesn't mix at all). There, I would replace theta_lb with C and replace theta_raw with S. I actually declared S with a lower bound of zero instead of declaring it on the log scale and exponentiating, but same idea.

-Eric
Reply all
Reply to author
Forward
0 new messages