rstan: how to discard or select certain chains from a stanfit object

1,412 views
Skip to first unread message

Dalton Hance

unread,
Dec 27, 2016, 12:46:04 PM12/27/16
to Stan users mailing list
I'm dealing with a model that is a little finicky and I'm in the intermediate exploration and covariate selection phase. I'm finding that the model is fairly sensitive to initial values and so part of the process is finding good initial values each time I chain the covariate structure. Sampling takes several hours for each run of the model. I'm using six chains and 1000 iterations and I'm finding that I occasionally will get a model run in which one or two chains will consist entirely of divergent transitions, but the rest of the chains look fine (this is due to initial values). I've been using those chains that sample fine to find the new initial values and then refitting the model. However, this is potentially a lot of time sunk into a model that I will eventually abandon.

So that leads to my questions. Using rstan is there a way to take a stanfit object and discard (or select) certain chains and save that result to a new stanfit object? I'd like to be able to explore the model without worrying about those problematic chains. I would not do something like this for the final model used for inference, but it could be a time saver in this intermediate model building and criticism phase.

And of course, please give me your prudent admonishments if this something I shouldn't be doing.

Ben Goodrich

unread,
Dec 27, 2016, 1:50:44 PM12/27/16
to Stan users mailing list
On Tuesday, December 27, 2016 at 12:46:04 PM UTC-5, Dalton Hance wrote:
And of course, please give me your prudent admonishments if this something I shouldn't be doing.

You shouldn't be doing this, in general. The chains without divergent transitions may have over-adapted to the bulk of the posterior distribution and never encountered the regions where the Markov Chains diverge. If we put in a function into rstan that dropped chains, people would use it.

Ben

Stephen Martin

unread,
Dec 27, 2016, 2:21:35 PM12/27/16
to Stan users mailing list
If you're getting divergent transitions, you should look into areas of the posterior where that seems to be occurring, or what makes some chains produce divergences and not other chains.

I only ever examine particular chains if I have a multimodal posterior where different chains are sampling different modes. This is not really an issue if the modes are label-swapped modes (e.g., in mixture models), but is definitely a problem if the modes lead to different inferences (not like label-swapping).

If some chains are failing to sample as well as other chains, you should look into different parameterizations.

Dalton Hance

unread,
Dec 27, 2016, 3:10:23 PM12/27/16
to Stan users mailing list
Thanks Stephen and Ben,

I suspected as much.

I'm not sure how I would go about alternative parameterizations given the complexity of the model. I'm dealing with a rather complicated posterior distribution and weak identifiability. The model wanders off into the weeds if I don't constrain the initial values.

If you've got some advice for alternative parameterizations I can describe the model in brief. I'm hesitant to paste too much of my code, but I can give you the essentials.

The conceptual model is as follows: imagine we are tracking cars between two toll booths, but at the second toll booth we have some lazy operators who allow some cars to pass without paying the toll. But the probability that the operator will not collect a toll depends on the hour at which the car passes (maybe it's raining or  it's right after lunch). If a car passes the second toll booth without paying, we don't know anything about when it passes, only that it passed. For cars that do pay the toll at the second toll booth, we know exactly how long it took them to drive the stretch of highway so we have an observed travel time and know exactly the hour in which they passed the toll booth. So here's what we know: the total number of cars that departed the first toll booth, the departure time of all cars from the first toll booth, the number of cars that payed a toll at the second toll booth, the exact arrival time of the cars that pay at the second toll booth, and the number of cars that did not pay a toll at the second toll booth and so we don't have an arrival time. We also have known covariates for every hour at both the first and second toll booth. For example if it is snowing when cars pass the first toll booth, we expect their travel times to be longer. If it is foggy at the second tool booth, we expect the probability of collecting a toll to be lower for that hour.

A general description of the math model is a zero-inflated gamma distribution, where the zero-inflation parameters depends on covariates conditional with the gamma value. I'm "integrating" (summing really) over all possible covariate values when the observed data is a zero by liberal use of gamma_lcdf and log_sum_exp values.

So for non-zero observed values I have the following (simplified) model:

  for (n in 1:N_obs) {
    travel_time[n] ~ gamma(alpha, beta);
    target += log(p[t_obs[n]]; //t_obs[n] is the integer index taken by truncating the observed travel time to correspond to the time step of my covariates 
  }

For zeroes I have this model:

  for (n in 1:N_mis) {
    real lp[n_incr]; // n_incr is the number of total possible time step increments for which I have covariates
    
    lp[1] = log1m(p[1]) + gamma_lcdf(1| alpha[t], beta[t]);
    for (j in 2:(n_incr-1) ){
      lp[j] = log1m(p[j]) +
              log_diff_exp(gamma_lcdf(j | alpha, beta),
                           gamma_lcdf(j - 1| alpha, beta));
    }
    lp[n_incr] = log1m(p[n_US_incr]) +
                            log_diff_exp(log(1),
                                                                                gamma_lcdf(n_incr-1 | alpha, beta));

       target +=  log_sum_exp(lp);
  }

The actual model is quite bit more complicated. (The above assumes only a single departure time.) There is both a gamma regression for travel times and a logistic regression for the zero inflation parameter.

Those regressions are parameterized as follows:

  for (t in 1:n_departure_incr){ 
    //n_departure_incr  implies multiple departure times (does NOT correspond to model chunk above message)
    g_eps[t] = sigma_g * g_raw[t];
    gamma_mu[t] = exp(X[t,] * A + g_eps[t]); //using the log link - hierarchical non-centered
    alpha[t] = shape; 
    beta[t]  = shape/gamma_mu[t];
  }

for (t in 1:n_incr){
    p_eps[t]   = sigma_p * p_raw[t]; // hierarchical using the non-centered parameterization
    p[t] = inv_logit(Y[t,] * B + p_eps[t]);
  }

I have reasonable priors specified (using recommendations from the Stan wiki).

It works well with simulated data and my posterior predictive checks looks ok. So I know the parameterization I have now works. But even with simulated data, I still have to specify initial values or I get these divergences. The divergences seem to stem from bad initial values.

I'll go ahead and refit models with divergences rather than try to short cut it. Thank you for the polite admonishments. 

Ben Goodrich

unread,
Dec 27, 2016, 3:56:54 PM12/27/16
to Stan users mailing list
On Tuesday, December 27, 2016 at 3:10:23 PM UTC-5, Dalton Hance wrote:
The conceptual model is as follows: imagine we are tracking cars between two toll booths, but at the second toll booth we have some lazy operators who allow some cars to pass without paying the toll. But the probability that the operator will not collect a toll depends on the hour at which the car passes (maybe it's raining or  it's right after lunch). If a car passes the second toll booth without paying, we don't know anything about when it passes, only that it passed. For cars that do pay the toll at the second toll booth, we know exactly how long it took them to drive the stretch of highway so we have an observed travel time and know exactly the hour in which they passed the toll booth. So here's what we know: the total number of cars that departed the first toll booth, the departure time of all cars from the first toll booth, the number of cars that payed a toll at the second toll booth, the exact arrival time of the cars that pay at the second toll booth, and the number of cars that did not pay a toll at the second toll booth and so we don't have an arrival time. We also have known covariates for every hour at both the first and second toll booth. For example if it is snowing when cars pass the first toll booth, we expect their travel times to be longer. If it is foggy at the second tool booth, we expect the probability of collecting a toll to be lower for that hour.

Sometimes things like this work better if you treat the unobserved travel times as missing data and declare them as parameters with lower=0 in the parameters block of the Stan program, rather than explicitly integrating over them (particularly if the log-CDF is not very numerically stable).

Ben

Dalton Hance

unread,
Dec 27, 2016, 4:46:32 PM12/27/16
to Stan users mailing list
Thanks Ben. I tried that in an earlier version of the model, but I was having trouble due to truncating the continuous variable to a discrete and thus breaking the gradient. But maybe it will work better if I use if statements to check whether the imputed travel time is within the interval I need.

Bob Carpenter

unread,
Dec 27, 2016, 5:06:58 PM12/27/16
to stan-...@googlegroups.com

> On Dec 27, 2016, at 3:10 PM, Dalton Hance <dalton...@gmail.com> wrote:
>
> Thanks Stephen and Ben,
>
> I suspected as much.
>
> I'm not sure how I would go about alternative parameterizations given the complexity of the model. I'm dealing with a rather complicated posterior distribution and weak identifiability.

It's possible to write down models that are well behaved
in theory but won't compute in practice. Weak identifiability
is particularly problematic, especially when combined with varying
curvature (different matrix of second order derivatives in different
regions of the posterior).

> The model wanders off into the weeds if I don't constrain the initial values.

What makes you call this "the weeds"? If the sampler wanders
off, it's because it's chasing the high probability mass (not
the highest density areas, which confuses a lot of people). This
is what it's supposed to do. The only solution is to find a better
sampler or change the model.

> If you've got some advice for alternative parameterizations I can describe the model in brief. I'm hesitant to paste too much of my code, but I can give you the essentials.

It's very hard to diagnose problems in complicated models or
with fragments of models. It leaves us guessing on lots of
what's going on elsewhere.

> The conceptual model is as follows: imagine we are tracking cars between two toll booths, but at the second toll booth we have some lazy operators who allow some cars to pass without paying the toll. But the probability that the operator will not collect a toll depends on the hour at which the car passes (maybe it's raining or it's right after lunch). If a car passes the second toll booth without paying, we don't know anything about when it passes, only that it passed. For cars that do pay the toll at the second toll booth, we know exactly how long it took them to drive the stretch of highway so we have an observed travel time and know exactly the hour in which they passed the toll booth. So here's what we know: the total number of cars that departed the first toll booth, the departure time of all cars from the first toll booth, the number of cars that payed a toll at the second toll booth, the exact arrival time of the cars that pay at the second toll booth, and the number of cars that did not pay a toll at the second toll booth and so we don't have an arrival time. We also have known covariates for every hour at both the first and second toll booth. For example if it is snowing when cars pass the first toll booth, we expect their travel times to be longer. If it is foggy at the second tool booth, we expect the probability of collecting a toll to be lower for that hour.
>
> A general description of the math model is a zero-inflated gamma distribution, where the zero-inflation parameters depends on covariates conditional with the gamma value. I'm "integrating" (summing really) over all possible covariate values when the observed data is a zero by liberal use of gamma_lcdf and log_sum_exp values.
>
> So for non-zero observed values I have the following (simplified) model:
>
> for (n in 1:N_obs) {
> travel_time[n] ~ gamma(alpha, beta);
> target += log(p[t_obs[n]]; //t_obs[n] is the integer index taken by truncating the observed travel time to correspond to the time step of my covariates
> }

I didn't understand this, but I think it may be related to what
Ben was trying to suggest in terms of treating the unobserved
end times as missing data.

gamma distributions can be very long tailed if alpha and
beta are allowed to go below 1.

Do you know what the parameter values are that lead to the divergences?

> For zeroes I have this model:
>
> for (n in 1:N_mis) {
> real lp[n_incr]; // n_incr is the number of total possible time step increments for which I have covariates
>
> lp[1] = log1m(p[1]) + gamma_lcdf(1| alpha[t], beta[t]);
> for (j in 2:(n_incr-1) ){
> lp[j] = log1m(p[j]) +
> log_diff_exp(gamma_lcdf(j | alpha, beta),
> gamma_lcdf(j - 1| alpha, beta));
> }
> lp[n_incr] = log1m(p[n_US_incr]) +
> log_diff_exp(log(1),
> gamma_lcdf(n_incr-1 | alpha, beta));
>
> target += log_sum_exp(lp);
> }
>
> The actual model is quite bit more complicated. (The above assumes only a single departure time.) There is both a gamma regression for travel times and a logistic regression for the zero inflation parameter.

Why choose gamma for travel times? Would something
like a lognormal work?

> Those regressions are parameterized as follows:
>
> for (t in 1:n_departure_incr){
> //n_departure_incr implies multiple departure times (does NOT correspond to model chunk above message)
> g_eps[t] = sigma_g * g_raw[t];
> gamma_mu[t] = exp(X[t,] * A + g_eps[t]); //using the log link - hierarchical non-centered
> alpha[t] = shape;
> beta[t] = shape/gamma_mu[t];
> }
>

exp() is often culprit in divergences. Unfortunately, we
don't have a gamma parameterization in terms of log parameters,
so even though you can caclulate log_beta stably, there's
no way to plug it into the gamma distribution.

This would also be more efficient vectorizing on t.

I don't know where shape comes from, but I don't see why you'd
define a vector of alpha[t] all set to the same value. You can
just plug the scalar into the gamma distribution.

> for (t in 1:n_incr){
> p_eps[t] = sigma_p * p_raw[t]; // hierarchical using the non-centered parameterization
> p[t] = inv_logit(Y[t,] * B + p_eps[t]);
> }
>
> I have reasonable priors specified (using recommendations from the Stan wiki).
>
> It works well with simulated data and my posterior predictive checks looks ok. So I know the parameterization I have now works. But even with simulated data, I still have to specify initial values or I get these divergences. The divergences seem to stem from bad initial values.

Are the divergences only during warmup? If so, I wouldn't
worry too much about them. We only care about divergences
that arise from the typical set (volume containing most of
the posterior probabilty mass, which may not even include
the posterior mean or mode).

> I'll go ahead and refit models with divergences rather than try to short cut it. Thank you for the polite admonishments.

But refitting has the same problem---that there are volumes of
probabilty mass you're not exploring.

- Bob

Ben Goodrich

unread,
Dec 27, 2016, 5:51:05 PM12/27/16
to Stan users mailing list
On Tuesday, December 27, 2016 at 4:46:32 PM UTC-5, Dalton Hance wrote:
Thanks Ben. I tried that in an earlier version of the model, but I was having trouble due to truncating the continuous variable to a discrete and thus breaking the gradient. But maybe it will work better if I use if statements to check whether the imputed travel time is within the interval I need.

I am not seeing why you would need to discretize continuous variables or use if statements (both of which will make life difficult for NUTS). If you do

parameters {
  ...
  vector<lower=0>[N_missing] missing_times;
}

then you have a continuous estimate for each car whose travel time is not observed.

Dalton Hance

unread,
Dec 27, 2016, 6:20:54 PM12/27/16
to Stan users mailing list
Hi Ben,

My interest is not really in the travel times, but in the discrete parameter that causes zero-inflation. This parameter varies with time. So using the toll booth analogy, if a car arrives at the second toll booth between hours 0 and 1 it's travel time would be observed with probability described by the p[1] parameter. But if it arrives between hours 1 and 2 it would be observed with a probability described by the p[2] parameter, and so forth. 

Because I don't observe travel time for cars that pass by the second toll booth, I need to work with the marginal distribution and integrate out the unobserved parameter. So if Y is my observed travel time and X is the true travel time that marginal distribution is:

Pr(Y = 0) = (1-p[1]) * Pr(X<1) + (1 - p[2]) *Pr(1<X<2) + ...

It's simpler for observed travel times, because that is:
Pr(Y= x) = p[trunc(x)] * Pr(X =x)

Bob Carpenter

unread,
Dec 27, 2016, 6:40:22 PM12/27/16
to stan-...@googlegroups.com
We usually don't recommend that kind of discretization because
it treats 1 minute and 59 minutes as more similar than 59 minutes
and 61 minutes by construction.

The computational problem arises because this kind of model
breaks continuous differentiability, which Hamiltonian Monte
Carlo assumes.

We intentionally avoided a function to round a continuous value to
an integer for just this reason. So you can't really use trunc(x)
as an index---you have to hack it with loops and conditionals (we may
eliminate conditionals based on parameters in Stan 3 as they typically
break differentiability).

You could explicitly marginalize out the discrete bin, but only
if there are finitely many time chunks.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dalton Hance

unread,
Dec 27, 2016, 6:52:12 PM12/27/16
to Stan users mailing list
Hi Bob,

I really appreciate your time. Apologies for the poor etiquette of a partial model. I don't think my original intent was to ask for help with reparameterization, but I guess I'm there now.  You have given me a good clue about what to do on that front. I think it might be better to try this with a lognormal model for travel times rather than the gamma. Answers to your comments below out of courtesy.


It's possible to write down models that are well behaved 
in theory but won't compute in practice.  Weak identifiability 
is particularly problematic, especially when combined with varying 
curvature (different matrix of second order derivatives in different 
regions of the posterior). 
I know I might be pushing the limits of what I can accomplish (limitations in my skillset moreso than software), but I getting the model to work on simulated data that shares many properties of the real data encouraged me. 
 

What makes you call this "the weeds"?  If the sampler wanders 
off, it's because it's chasing the high probability mass (not 
the highest density areas, which confuses a lot of people).  This 
is what it's supposed to do.  The only solution is to find a better 
sampler or change the model. 

Just using colorful language. For some or all chains, I get divergent transitions for every iteration without picking smart initial values. I've seen the "Find better initial values" advice elsewhere on this forum, and when I find those values the model will sample without too much trouble, it just takes times.  

It's very hard to diagnose problems in complicated models or 
with fragments of models.  It leaves us guessing on lots of 
what's going on elsewhere. 

Yes. I know that, apologies for poor etiquette. Two reasons: 1) the model has some idiosyncrasies that make it hard to understand at a casual glance. I'm confident of it working given that I have simulated data with those idiosyncrasies and I'm able to retrieve the simulated parameters. 2) we'd like to publish this (this is for an ecological application) and I don't want to let the cat too much out of the bag quite yet. The stan users list will be the first line in the acknowledgements. :)

I didn't understand this, but I think it may be related to what 
Ben was trying to suggest in terms of treating the unobserved 
end times as missing data. 

gamma distributions can be very long tailed if alpha and 
beta are allowed to go below 1. 

Do you know what the parameter values are that lead to the divergences? 

I seem to get divergences when I don't pick good initial values for the gamma regression mean parameters, but I haven't thoroughly tested 


Why choose gamma for travel times?  Would something 
like a lognormal work? 

I don't have a strong reason for choosing gamma, other than that's where I started when I built this up from a simple simulation to a ever more complicated cases to actual data. Yes, a lognormal would work. And it is something I was planning on exploring. I can certainly avoid the step of using exp() for the gamma mean if I go this route. 

exp() is often culprit in divergences.   Unfortunately, we 
don't have a gamma parameterization in terms of log parameters, 
so even though you can caclulate log_beta stably, there's 
no way to plug it into the gamma distribution. 
 
This would also be more efficient vectorizing on t. 
 
I don't know where shape comes from, but I don't see why you'd 
define a vector of alpha[t] all set to the same value.  You can 
just plug the scalar into the gamma distribution. 

That is all good advice. I'm still in the intermediate phase of model tweaking.

I think based on your advice here, it would be very prudent of me to look into reparameterizing using a lognormal.

Are the divergences only during warmup?  If so, I wouldn't 
worry too much about them.  We only care about divergences 
that arise from the typical set (volume containing most of 
the posterior probabilty mass, which may not even include 
the posterior mean or mode). 

No they're during sampling. The behavior I'm seeing is that if I'm aiming for 2000 post-warm up samples with 4 chains I will sometimes get a message telling me that there were 2000  divergent transitions. But sometimes I get a message that there were 500 divergent transitions. When I look at the traceplots in the first case I see a set of wandering lines. But when in the second case, I see three nice looking traceplots indicating good mixing, with one wandering line. So that one chain derailed and I'm convinced it derailed due to bad initial conditions (probably because my gamma_mu term gets too small). If I refit specifying initial conditions then I get no divergent transitions.

On Tuesday, December 27, 2016 at 2:06:58 PM UTC-8, Bob Carpenter wrote:

> On Dec 27, 2016, at 3:10 PM, Dalton Hance <dalton...@gmail.com> wrote:
>
> Thanks Stephen and Ben,
>
> I suspected as much.
>
> I'm not sure how I would go about alternative parameterizations given the complexity of the model. I'm dealing with a rather complicated posterior distribution and weak identifiability.



> The model wanders off into the weeds if I don't constrain the initial values.

 
> If you've got some advice for alternative parameterizations I can describe the model in brief. I'm hesitant to paste too much of my code, but I can give you the essentials.



 

> The conceptual model is as follows: imagine we are tracking cars between two toll booths, but at the second toll booth we have some lazy operators who allow some cars to pass without paying the toll. But the probability that the operator will not collect a toll depends on the hour at which the car passes (maybe it's raining or  it's right after lunch). If a car passes the second toll booth without paying, we don't know anything about when it passes, only that it passed. For cars that do pay the toll at the second toll booth, we know exactly how long it took them to drive the stretch of highway so we have an observed travel time and know exactly the hour in which they passed the toll booth. So here's what we know: the total number of cars that departed the first toll booth, the departure time of all cars from the first toll booth, the number of cars that payed a toll at the second toll booth, the exact arrival time of the cars that pay at the second toll booth, and the number of cars that did not pay a toll at the second toll booth and so we don't have an arrival time. We also have known covariates for every hour at both the first and second toll booth. For example if it is snowing when cars pass the first toll booth, we expect their travel times to be longer. If it is foggy at the second tool booth, we expect the probability of collecting a toll to be lower for that hour.
>
> A general description of the math model is a zero-inflated gamma distribution, where the zero-inflation parameters depends on covariates conditional with the gamma value. I'm "integrating" (summing really) over all possible covariate values when the observed data is a zero by liberal use of gamma_lcdf and log_sum_exp values.
>
> So for non-zero observed values I have the following (simplified) model:
>
>   for (n in 1:N_obs) {
>     travel_time[n] ~ gamma(alpha, beta);
>     target += log(p[t_obs[n]]; //t_obs[n] is the integer index taken by truncating the observed travel time to correspond to the time step of my covariates
>   }

> For zeroes I have this model:
>
>   for (n in 1:N_mis) {
>     real lp[n_incr]; // n_incr is the number of total possible time step increments for which I have covariates
>    
>     lp[1] = log1m(p[1]) + gamma_lcdf(1| alpha[t], beta[t]);
>     for (j in 2:(n_incr-1) ){
>       lp[j] = log1m(p[j]) +
>               log_diff_exp(gamma_lcdf(j | alpha, beta),
>                            gamma_lcdf(j - 1| alpha, beta));
>     }
>     lp[n_incr] = log1m(p[n_US_incr]) +
>                             log_diff_exp(log(1),
>                                                                                 gamma_lcdf(n_incr-1 | alpha, beta));
>
>        target +=  log_sum_exp(lp);
>   }
>
> The actual model is quite bit more complicated. (The above assumes only a single departure time.) There is both a gamma regression for travel times and a logistic regression for the zero inflation parameter.

> Those regressions are parameterized as follows:
>
>   for (t in 1:n_departure_incr){
>     //n_departure_incr  implies multiple departure times (does NOT correspond to model chunk above message)
>     g_eps[t] = sigma_g * g_raw[t];
>     gamma_mu[t] = exp(X[t,] * A + g_eps[t]); //using the log link - hierarchical non-centered
>     alpha[t] = shape;
>     beta[t]  = shape/gamma_mu[t];
>   }
>



> for (t in 1:n_incr){
>     p_eps[t]   = sigma_p * p_raw[t]; // hierarchical using the non-centered parameterization
>     p[t] = inv_logit(Y[t,] * B + p_eps[t]);
>   }
>
> I have reasonable priors specified (using recommendations from the Stan wiki).
>
> It works well with simulated data and my posterior predictive checks looks ok. So I know the parameterization I have now works. But even with simulated data, I still have to specify initial values or I get these divergences. The divergences seem to stem from bad initial values.



Dalton Hance

unread,
Dec 27, 2016, 7:18:35 PM12/27/16
to Stan users mailing list
Hi Bob,

Again thank you for the good advice. I agree with you that there is something goofy about discretizing this. But my logic in this application is that I have covariates at some minimum discrete timestep. For example using the toll booth analogy, I would have hourly measurements of visibility. If those covariates are smoothly changing over time then my discrete parameters should also be somewhat smooth. (One thing I also want to try is putting something like an AR(1) process in here.) There is also some practical reasons to discretize in my application. Using the toll booth analogy that would be something like a shift change resulting in new toll booth operator. 

I have given some thought to something like a Guassian process describing a continuously changing probability of detection overlain on top of this travel time process. But that is a bridge too far with my skill set at the moment.

We intentionally avoided a function to round a continuous value to 
an integer for just this reason.  So you can't really use trunc(x) 
as an index---you have to hack it with loops and conditionals (we may 
eliminate conditionals based on parameters in Stan 3 as they typically 
break differentiability). 

I understand your reasons, but it would be nice to more easily switch between integer and real, for example in the generated quantities step. I think I'm on solid grounds for the truncating a real to an integer for my application.

You could explicitly marginalize out the discrete bin, but only 
if there are finitely many time chunks.   

I do have finitely many time chunks. And isn't marginalizing out the discrete bin what I'm doing? 

This: 
 Pr(Y = 0) = (1-p[1]) * Pr(X<1) + (1 - p[2]) *Pr(1<X<2) + ... + (1 - p[n_incr]) * (1 - Pr(X < n_incr))

Is represented in the Stan as:
  for (n in 1:N_mis) {
    real lp
[n_incr]; // n_incr is the number of total possible time step increments for which I have covariates
   
    lp
[1] = log1m(p[1]) + gamma_lcdf(1| alpha[t], beta[t]);
    for (j in 2:(n_incr-1) ){
     lp[j] = log1m(p[j]) +
             log_diff_exp(gamma_lcdf(j | alpha, beta),
                          gamma_lcdf(j - 1| alpha, beta));
   }
    lp[n_incr] = log1m(p[n_US_incr]) +
                           log_diff_exp(log(1),
                                                                               gamma_lcdf(n_incr-1 | alpha, beta));

       target +=  log_sum_exp(lp);
 }

And for the observed data I have, Pr(Y= x) = p[trunc(x)] * Pr(X =x) as:

   for (n in 1:N_obs) {
       travel_time[n] ~ gamma(alpha, beta);
       target += log(p[t_obs[n]]; //t_obs[n] is the integer index taken by truncating the observed travel time to correspond to the time step of my covariates
   }

So even though I'm truncating here to get an index, I'm truncating data. This should affect the gradient for determining the alpha and beta parameters (at least for the observed piece). 

Bob Carpenter

unread,
Dec 27, 2016, 7:45:44 PM12/27/16
to stan-...@googlegroups.com

> On Dec 27, 2016, at 7:18 PM, Dalton Hance <dalton...@gmail.com> wrote:
>
> Hi Bob,
>
> Again thank you for the good advice. I agree with you that there is something goofy about discretizing this. But my logic in this application is that I have covariates at some minimum discrete timestep.

That sounds like a different model.

> For example using the toll booth analogy, I would have hourly measurements of visibility. If those covariates are smoothly changing over time then my discrete parameters should also be somewhat smooth. (One thing I also want to try is putting something like an AR(1) process in here.) There is also some practical reasons to discretize in my application. Using the toll booth analogy that would be something like a shift change resulting in new toll booth operator.
>
> I have given some thought to something like a Guassian process describing a continuously changing probability of detection overlain on top of this travel time process. But that is a bridge too far with my skill set at the moment.
>
> We intentionally avoided a function to round a continuous value to
> an integer for just this reason. So you can't really use trunc(x)
> as an index---you have to hack it with loops and conditionals (we may
> eliminate conditionals based on parameters in Stan 3 as they typically
> break differentiability).
>
> I understand your reasons, but it would be nice to more easily switch between integer and real, for example in the generated quantities step.

Agreed, but we don't have a way to expose functions just for data
yet. As is, any function in Stan can be applied to parameters or
data in any argument.

> I think I'm on solid grounds for the truncating a real to an integer for my application.
>
> You could explicitly marginalize out the discrete bin, but only
> if there are finitely many time chunks.
>
> I do have finitely many time chunks. And isn't marginalizing out the discrete bin what I'm doing?
>
> This:
> Pr(Y = 0) = (1-p[1]) * Pr(X<1) + (1 - p[2]) *Pr(1<X<2) + ... + (1 - p[n_incr]) * (1 - Pr(X < n_incr))

Yes. But I don't get why it's (1 - p[1]) rather than p[1]. Same in the code
below where it's the same thing on the log scale.

- Bob

Dalton Hance

unread,
Dec 27, 2016, 8:40:29 PM12/27/16
to Stan users mailing list
Yes.  But I don't get why it's (1 - p[1]) rather than p[1].  Same in the code 
below where it's the same thing on the log scale. 

Think conditionally. We have a partially hidden variable, X, which is the travel time of the car between the first toll booth and the second toll booth. If the car arrives at the second toll booth in the first hour (travel time between 0 and 1 hour) it is detected with probability p[1]. If it arrives in the second hour (travel time between 1 and 2) it is detected with probability p[2] and so forth until the Nth hour. Any car that arrives at the last hour or later is detected with probability p[N]. Lets assign the detection state to a conditional variable Y which is Bernoulli conditional on X. That is Pr(Y = 1|X = x) = p[trunc(x) + 1] and Pr(Y = 0 | X = x) = 1 - p[trunc(x) + 1]. (Note I was sloppy earlier and should have included the + 1 after the truncation.) Finally lets call our observed data Z with Z = X * Y.

The detected case is easy. That gives us Pr(Z = x) = Pr(Y = 1 | X = x) * Pr(X = x) = p[trunc(x) + 1] * Pr(X = x), where Pr(X =x) is given by a gamma (or lognormal) pdf.

For the non-detected case we have to sum over all the possible travel times to integrate out the unobserved travel time. Pr(Z = 0) =  Integration over the support of X of Pr(Y = 0 | X = x ) * Pr(X = x) = (1-p[1]) * Pr(X<1) + (1 - p[2]) *Pr(1<X<2) + ... + (1 - p[n_incr]) * (1 - Pr(X < n_incr - 1)), where Pr(X < 1) is the cdf now. 

Sorry if the description was overkill. You're usually way over my head, so you probably don't need me to lead you by the hand.

Bob Carpenter

unread,
Dec 29, 2016, 12:58:25 PM12/29/16
to stan-...@googlegroups.com

> On Dec 27, 2016, at 8:40 PM, Dalton Hance <dalton...@gmail.com> wrote:
>
> Yes. But I don't get why it's (1 - p[1]) rather than p[1]. Same in the code
> below where it's the same thing on the log scale.
>
> Think conditionally. We have a partially hidden variable, X, which is the travel time of the car between the first toll booth and the second toll booth. If the car arrives at the second toll booth in the first hour (travel time between 0 and 1 hour) it is detected with probability p[1]. If it arrives in the second hour (travel time between 1 and 2) it is detected with probability p[2] and so forth until the Nth hour. Any car that arrives at the last hour or later is detected with probability p[N]. Lets assign the detection state to a conditional variable Y which is Bernoulli conditional on X. That is Pr(Y = 1|X = x) = p[trunc(x) + 1] and Pr(Y = 0 | X = x) = 1 - p[trunc(x) + 1]. (Note I was sloppy earlier and should have included the + 1 after the truncation.) Finally lets call our observed data Z with Z = X * Y.

Got it. Thanks.

> The detected case is easy. That gives us Pr(Z = x) = Pr(Y = 1 | X = x) * Pr(X = x) = p[trunc(x) + 1] * Pr(X = x), where Pr(X =x) is given by a gamma (or lognormal) pdf.
>
> For the non-detected case we have to sum over all the possible travel times to integrate out the unobserved travel time. Pr(Z = 0) = Integration over the support of X of Pr(Y = 0 | X = x ) * Pr(X = x) = (1-p[1]) * Pr(X<1) + (1 - p[2]) *Pr(1<X<2) + ... + (1 - p[n_incr]) * (1 - Pr(X < n_incr - 1)), where Pr(X < 1) is the cdf now.
>
> Sorry if the description was overkill. You're usually way over my head, so you probably don't need me to lead you by the hand.

Thanks. I very much appreciate the longer form. It's very hard for me
to jump into new models and notations. I can usually work them out given
time, but it's easier to just ask a few questions. So I appreciate the longer form!

- Bob

Reply all
Reply to author
Forward
0 new messages