Issue fitting ordered logistic model: "error evaluating the log probability at the initial value."

Jesse Wolfhagen

unread,

Jul 29, 2016, 10:32:37 AM7/29/16

to Stan users mailing list

Hello, I am using RStan to try to fit an ordered logistic model to data where the specific stage of each data point is unknown (the probability of being in each stage is known). The basic model works when data are given specific stages, but I am running into issues trying to base it off of probabilities of being in each stage instead. I am not sure how to address the error here, it's not clear to me where the invalid values are. I have tried running it with initial values for AgeClass (the result of the categorical_rng), but I get the same error. Am I not specifying the AgeClass variable correctly because it is being generated within the model?

The model (and some simulated data) are here:

Model:

data{

int<lower=1> N; //number of mandibles

matrix[N,9] ageprobs; //probabilities of being in each wear stage

}

parameters{

ordered[8] agebreaks;

}

model{

vector[N] phi; //dummy for coefficients, currently no model

int AgeClass[N];

agebreaks ~ normal( 0 , 10 );

for ( i in 1:N ) {

phi[i] = 0;

}

for ( i in 1:N ){

AgeClass[i] ~ ordered_logistic( phi[i] , agebreaks );

}

generated quantities{

vector[N] phi;

int AgeClass[N];

for ( i in 1:N ) {

phi[i] = 0;

}

for ( i in 1:N ){

AgeClass[i] = categorical_rng(softmax(to_vector(ageprobs[i,])));

}

Initialization in R:

lowerstage <- rep(c(1,2,3,4,5,6,7,8,9),each=5)

upperstage <- c(1:5,2:6,3:7,4:8,5:9,c(6:9,9),c(7:9,9,9),c(8:9,9,9,9),rep(9,5))

aoristic.jaws <- function(jaw,limit)

{

vals <- rep(0,limit)

vals[jaw[1]:jaw[2]] <- 1/(1+jaw[2]-jaw[1])

vals

}

mandible.sim <- t(apply(cbind(lowerstage,upperstage),1,aoristic.jaws,limit=9))

mandible.dat <- list(N=nrow(mandible.sim),

ageprobs = mandible.sim)

mandibleinit <- function() {list(agebreaks = sort(runif(8,-3,3)))}

fit <- stan(file = 'mandiblewear.stan', data = mandible.dat, init = mandibleinit, iter = 1000, chains = 1)

I get the following diagnostic error and then a fatal error that stops/prevents sampling:

DIAGNOSTIC(S) FROM PARSER:

Warning (non-fatal):

Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.

If so, you need to call increment_log_prob() with the log absolute determinant of the Jacobian of the transform.

Left-hand-side of sampling statement:

AgeClass[i] ~ ordered_logistic(...)

Error message:

SAMPLING FOR MODEL 'mandiblewear' NOW (CHAIN 1).

[1] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"

[2] "Exception thrown at line 19: stan::math::ordered_logistic: Random variable is 0, but must be between (1, 9)"

[3] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"

[4] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."

[5] "Rejecting initial value:"

[6] " Error evaluating the log probability at the initial value."

[7] "Error : "

[1] "error occurred during calling the sampler; sampling not done"

sessionInfo()

R version 3.3.1 (2016-06-21)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows >= 8 x64 (build 9200)

locale:

[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252

[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] rstan_2.10.1 StanHeaders_2.10.0-2 ggplot2_2.1.0

loaded via a namespace (and not attached):

[1] colorspace_1.2-6 scales_0.4.0 plyr_1.8.4 rsconnect_0.4.3 parallel_3.3.1 tools_3.3.1 inline_0.3.14 gtable_0.2.0

[9] gridExtra_2.2.1 Rcpp_0.12.5 codetools_0.2-14 grid_3.3.1 stats4_3.3.1 munsell_0.4.3

Bob Carpenter

unread,

Jul 29, 2016, 3:08:32 PM7/29/16

to stan-...@googlegroups.com

The problem is that you are using AgeClass before defining
it:

> model{
> vector[N] phi; //dummy for coefficients, currently no model
> int AgeClass[N];
> agebreaks ~ normal( 0 , 10 );
> for ( i in 1:N ) {
> phi[i] = 0;
> }
> for ( i in 1:N ){
> AgeClass[i] ~ ordered_logistic( phi[i] , agebreaks );
> }

The sampling statement is just shorthand for

target += ordered_logistic(AgeClass[i] | phi[i], agebreaks);

so AgeClass[i] needs to be defined before it is used. I'm
not sure what you're trying to do here, so I don't know how
to suggest you fix it. What you did in the generated quantities
block is how to do random generation from phi and agebreaks.

You also don't want to define a constant vector in the
model---it's more efficient in transformed data, and you can
assign it to rep_vector(0, N).

- Bob

P.S. Queries like this are the main reason we're considering
just eliminating the ~ statements altogether so they're not
conflated with random number generators.

Jesse Wolfhagen

unread,

Jul 31, 2016, 11:56:24 AM7/31/16

to Stan users mailing list

Hello,

Thank you for responding so quickly!

I understand the issue about defining AgeClass[i], but I guess I'm stumped as to how to address it directly. I am trying to fit an ordered logistic regression to a series of mandibles that can be assigned to ranked age stages (1-9). This works fine when all of the mandibles are assigned to a specific stage, but in reality several are broken and so cannot be assigned to one stage exclusively (for example, a mandible could be in either stage 1 or 2, thus given a probability of 0.5 for each stage and 0 for the others). I was trying to find a way to incorporate that uncertainty into the process of estimating the intercepts (which currently would give an idea of uncertainty due to sample size), rather than having to assign stages and run the model, then repeat to get an idea of how much of it changes based on the individual assignments.

Ideally, I'd like to make AgeClass a parameter (as it gets estimated via the categorical distribution call), but I cannot make an int parameter. I've tried using it as a local variable within the model block, which is what creates the error about it not being specified, even when I include initial values. Placing it in the data or transformed data blocks requires initial values, which fixes those values for the model rather than allowing the observed age state to vary.

Maybe I'm still just confusing the issue of the sampling statement vs. "values being drawn FROM a distribution" (the random number generators): there doesn't appear to be a way to draw from a distribution before the modeling step (that is, all transformations are deterministic), is that correct?

Is this something that requires a Gibbs sampler and so can't be done in Stan?

Here is my updated model (still has the same issue/error as previously, but this time SHOULD be less ... confused, I hope):

data{

int<lower=1> N; //# of mandibles

simplex[9] ageprobs[N]; //probability of a mandible being in each age stage

}

transformed data{

vector[N] phi;

phi = rep_vector(0, N);

}

parameters{

ordered[8] agebreaks;

}

model{

int AgeClass[N];

agebreaks ~ normal( 0 , 10 );

for ( i in 1:N )

AgeClass[i] ~ categorical(ageprobs[i]);

for ( i in 1:N )

AgeClass[i] ~ ordered_logistic( phi[i] , agebreaks );

}

Bob Carpenter

unread,

Jul 31, 2016, 6:17:59 PM7/31/16

to stan-...@googlegroups.com

> On Jul 31, 2016, at 11:56 AM, Jesse Wolfhagen <jl.wol...@gmail.com> wrote:
>
> Hello,
>
> Thank you for responding so quickly!
>
> I understand the issue about defining AgeClass[i], but I guess I'm stumped as to how to address it directly. I am trying to fit an ordered logistic regression to a series of mandibles that can be assigned to ranked age stages (1-9). This works fine when all of the mandibles are assigned to a specific stage, but in reality several are broken and so cannot be assigned to one stage exclusively (for example, a mandible could be in either stage 1 or 2, thus given a probability of 0.5 for each stage and 0 for the others). I was trying to find a way to incorporate that uncertainty into the process of estimating the intercepts (which currently would give an idea of uncertainty due to sample size), rather than having to assign stages and run the model, then repeat to get an idea of how much of it changes based on the individual assignments.

A typical way to do that would be to average over both outcomes with
the probability as weight. So if you have foo_lpmf(1 | theta) and foo_lpmf(2 | theta),
and they're weighted 50% each, then you want

target += 0.5 * foo_lpmf(1 | theta) + 0.5 * foo_lpmf(2 | theta);

> Ideally, I'd like to make AgeClass a parameter (as it gets estimated via the categorical distribution call), but I cannot make an int parameter.

It's better to do the above for sampling efficiency. It just
marginalizes out what would otherwise be the integer parameter.

> I've tried using it as a local variable within the model block, which is what creates the error about it not being specified, even when I include initial values. Placing it in the data or transformed data blocks requires initial values, which fixes those values for the model rather than allowing the observed age state to vary.
>
> Maybe I'm still just confusing the issue of the sampling statement vs. "values being drawn FROM a distribution" (the random number generators): there doesn't appear to be a way to draw from a distribution before the modeling step (that is, all transformations are deterministic), is that correct?
>
> Is this something that requires a Gibbs sampler and so can't be done in Stan?

Gibbs is only one approach to discrete sampling, but the answer
is "no", you can just marginalize it out like any other variable.

> Here is my updated model (still has the same issue/error as previously, but this time SHOULD be less ... confused, I hope):
>
> data{
> int<lower=1> N; //# of mandibles
> simplex[9] ageprobs[N]; //probability of a mandible being in each age stage
> }
> transformed data{
> vector[N] phi;
> phi = rep_vector(0, N);
> }
> parameters{
> ordered[8] agebreaks;
> }
> model{
> int AgeClass[N];
> agebreaks ~ normal( 0 , 10 );
> for ( i in 1:N )
> AgeClass[i] ~ categorical(ageprobs[i]);
> for ( i in 1:N )
> AgeClass[i] ~ ordered_logistic( phi[i] , agebreaks );
> }

You almost never want the same variable, e.g., AgeClass[i] on
the left-hand side of ~ more than once. At least not for
a typical generative Bayesian model.

I'm not sure what you're trying to infer here. The cutpoints
for agebreaks aren't doing anything for you without a real predictor
for phi. It essentially just makes the predictors uniform and
the agebreaks variable won't be identified.

- Bob

Reply all

Reply to author

Forward