> On Dec 27, 2016, at 3:10 PM, Dalton Hance <
dalton...@gmail.com> wrote:
>
> Thanks Stephen and Ben,
>
> I suspected as much.
>
> I'm not sure how I would go about alternative parameterizations given the complexity of the model. I'm dealing with a rather complicated posterior distribution and weak identifiability.
It's possible to write down models that are well behaved
in theory but won't compute in practice. Weak identifiability
is particularly problematic, especially when combined with varying
curvature (different matrix of second order derivatives in different
regions of the posterior).
> The model wanders off into the weeds if I don't constrain the initial values.
What makes you call this "the weeds"? If the sampler wanders
off, it's because it's chasing the high probability mass (not
the highest density areas, which confuses a lot of people). This
is what it's supposed to do. The only solution is to find a better
sampler or change the model.
> If you've got some advice for alternative parameterizations I can describe the model in brief. I'm hesitant to paste too much of my code, but I can give you the essentials.
It's very hard to diagnose problems in complicated models or
with fragments of models. It leaves us guessing on lots of
what's going on elsewhere.
> The conceptual model is as follows: imagine we are tracking cars between two toll booths, but at the second toll booth we have some lazy operators who allow some cars to pass without paying the toll. But the probability that the operator will not collect a toll depends on the hour at which the car passes (maybe it's raining or it's right after lunch). If a car passes the second toll booth without paying, we don't know anything about when it passes, only that it passed. For cars that do pay the toll at the second toll booth, we know exactly how long it took them to drive the stretch of highway so we have an observed travel time and know exactly the hour in which they passed the toll booth. So here's what we know: the total number of cars that departed the first toll booth, the departure time of all cars from the first toll booth, the number of cars that payed a toll at the second toll booth, the exact arrival time of the cars that pay at the second toll booth, and the number of cars that did not pay a toll at the second toll booth and so we don't have an arrival time. We also have known covariates for every hour at both the first and second toll booth. For example if it is snowing when cars pass the first toll booth, we expect their travel times to be longer. If it is foggy at the second tool booth, we expect the probability of collecting a toll to be lower for that hour.
>
> A general description of the math model is a zero-inflated gamma distribution, where the zero-inflation parameters depends on covariates conditional with the gamma value. I'm "integrating" (summing really) over all possible covariate values when the observed data is a zero by liberal use of gamma_lcdf and log_sum_exp values.
>
> So for non-zero observed values I have the following (simplified) model:
>
> for (n in 1:N_obs) {
> travel_time[n] ~ gamma(alpha, beta);
> target += log(p[t_obs[n]]; //t_obs[n] is the integer index taken by truncating the observed travel time to correspond to the time step of my covariates
> }
I didn't understand this, but I think it may be related to what
Ben was trying to suggest in terms of treating the unobserved
end times as missing data.
gamma distributions can be very long tailed if alpha and
beta are allowed to go below 1.
Do you know what the parameter values are that lead to the divergences?
> For zeroes I have this model:
>
> for (n in 1:N_mis) {
> real lp[n_incr]; // n_incr is the number of total possible time step increments for which I have covariates
>
> lp[1] = log1m(p[1]) + gamma_lcdf(1| alpha[t], beta[t]);
> for (j in 2:(n_incr-1) ){
> lp[j] = log1m(p[j]) +
> log_diff_exp(gamma_lcdf(j | alpha, beta),
> gamma_lcdf(j - 1| alpha, beta));
> }
> lp[n_incr] = log1m(p[n_US_incr]) +
> log_diff_exp(log(1),
> gamma_lcdf(n_incr-1 | alpha, beta));
>
> target += log_sum_exp(lp);
> }
>
> The actual model is quite bit more complicated. (The above assumes only a single departure time.) There is both a gamma regression for travel times and a logistic regression for the zero inflation parameter.
Why choose gamma for travel times? Would something
like a lognormal work?
> Those regressions are parameterized as follows:
>
> for (t in 1:n_departure_incr){
> //n_departure_incr implies multiple departure times (does NOT correspond to model chunk above message)
> g_eps[t] = sigma_g * g_raw[t];
> gamma_mu[t] = exp(X[t,] * A + g_eps[t]); //using the log link - hierarchical non-centered
> alpha[t] = shape;
> beta[t] = shape/gamma_mu[t];
> }
>
exp() is often culprit in divergences. Unfortunately, we
don't have a gamma parameterization in terms of log parameters,
so even though you can caclulate log_beta stably, there's
no way to plug it into the gamma distribution.
This would also be more efficient vectorizing on t.
I don't know where shape comes from, but I don't see why you'd
define a vector of alpha[t] all set to the same value. You can
just plug the scalar into the gamma distribution.
> for (t in 1:n_incr){
> p_eps[t] = sigma_p * p_raw[t]; // hierarchical using the non-centered parameterization
> p[t] = inv_logit(Y[t,] * B + p_eps[t]);
> }
>
> I have reasonable priors specified (using recommendations from the Stan wiki).
>
> It works well with simulated data and my posterior predictive checks looks ok. So I know the parameterization I have now works. But even with simulated data, I still have to specify initial values or I get these divergences. The divergences seem to stem from bad initial values.
Are the divergences only during warmup? If so, I wouldn't
worry too much about them. We only care about divergences
that arise from the typical set (volume containing most of
the posterior probabilty mass, which may not even include
the posterior mean or mode).
> I'll go ahead and refit models with divergences rather than try to short cut it. Thank you for the polite admonishments.
But refitting has the same problem---that there are volumes of
probabilty mass you're not exploring.
- Bob