Undefined variables in generated quantities, categorical logit discrete choice model

452 views
Skip to first unread message

Kyle

unread,
Jan 13, 2017, 12:38:28 PM1/13/17
to Stan users mailing list

Hi all,

I'm currently estimating discrete choice models using Stan via RStan 2.14 and R 3.3.2., and I'm having an issue with two generated quantities I'm trying to use to calculate a Bayesian p-value for the model. I receive an error that "The following variables have undefined values:  chis_obs,The following variables have undefined values:  chis_sim"

I've pasted the model below. Since the two quantities having problems are sums of the chi-square expected and observed values, I'm thinking the problem lies in calculating the "expected[a]" vector, since this is used in both sums. From looking into the topic, I've gathered some ideas about the problem. I think it could be related to numerical over/underflow in the softmax function?

I should note is that using a slightly altered version of this model, a hierarchical version where the beta[J][K] (for j individuals in a population) were ~normal(mu[K], sigma[K]), this calculation worked. I did not have the log_lik computation in that version, but figured that wouldn't affect these generated quantities in question. Is that a poor assumption?

Any comments on improving efficiency or other practical suggestions are certainly welcomed. Thank you!

-Kyle

data{
    int C;  // the number of choices per set
    int K;  // the number of slope parameters per individual
    int N;  // the number of observations
    matrix[C, K] x[N];  // variables in design matrix format
    int y[N]; // index of which alternative was selected
    vector[C] obs; //variable used in test-stat calculation
    vector[C] pos[C];
  }
  
  parameters{
    vector[K] beta;  // rsf coefficients
  }
  
  model{
    
    for(l in 1:K){
      beta[K] ~ normal(0, 10);  // prior distribution for slope parameter mean
    }
  
    for(i in 1:N){
      y[i] ~ categorical_logit(softmax(x[i] * beta));
    }
  }
  
  generated quantities{
    simplex[C] expected[N]; //the probabilities of use from dc_model
    vector[N] chis_obs_i;   //chi-square value for each choice set using observed data
    vector[N] chis_sim_i;  // chi-square value from simulated data
    real chis_obs;  //sum of chi square values across all choice sets
    real chis_sim;  //sum of chi square values across all choice sets
    vector[C] rch[N];  //simulated random choice of used alt. within obs. data sets
    vector[N] log_lik; // log likelihood
    int rcat[N];
  
    for(a in 1:N){
      expected[a] = softmax(x[a] * beta);
  
      rcat[a] = categorical_rng(expected[a]);
      rch[a] = pos[rcat[a]];
  
      chis_obs_i[a] = sum(((obs - expected[a]) .* (obs - expected[a])) ./
                         expected[a]);
      chis_sim_i[a] = sum(((rch[a] - expected[a]) .* (rch[a] - expected[a])) ./
                         expected[a]);
    
      log_lik[a] = categorical_logit_lpmf(y[a]| x[a] * beta);
    }
    chis_obs = sum(chis_obs_i);
    chis_sim = sum(chis_sim_i);
  
  }

Ben Goodrich

unread,
Jan 13, 2017, 12:47:53 PM1/13/17
to Stan users mailing list
On Friday, January 13, 2017 at 12:38:28 PM UTC-5, Kyle wrote:
I'm currently estimating discrete choice models using Stan via RStan 2.14 and R 3.3.2., and I'm having an issue with two generated quantities I'm trying to use to calculate a Bayesian p-value for the model. I receive an error that "The following variables have undefined values:  chis_obs,The following variables have undefined values:  chis_sim"

I would replace chis_obs and chis_sim with negative numbers if is_nan() or is_infinite() are true and then figure out what went wrong in those cases from R.

Ben

Kyle

unread,
Jan 13, 2017, 1:01:37 PM1/13/17
to Stan users mailing list
Great, thanks Ben, I will try this and reply back.

-Kyle

Bob Carpenter

unread,
Jan 13, 2017, 1:36:47 PM1/13/17
to stan-...@googlegroups.com
You can also print out the arguments in those cases and
the result of softmax within the generated quantities, but
I'd do that within the checks Ben suggested or you'll get
too much output to digest.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kyle

unread,
Jan 13, 2017, 5:17:58 PM1/13/17
to Stan users mailing list
Thanks for your suggestion as well Bob. I've now run the model in both ways, and it appears that for most of the observations, the expected[a] simplex is (1.0e+00,0.0e+00,0.0e+00,0.0e+00,0.0e+00,0.0e+00), or including practically zero, such as 2.12e+-243. The chis_sim and chis_obs are thus always NaNs, because they are sums of quantities divided by zero. Any immediate ideas? I was only able to investigate briefly, and will continue checking more thoroughly later, just thought I'd post this now. 

Thanks,
-Kyle

Bob Carpenter

unread,
Jan 14, 2017, 8:34:17 PM1/14/17
to stan-...@googlegroups.com
I'd check your data and start from a simpler model and build up.

- Bob

Kyle

unread,
Jan 15, 2017, 4:35:23 PM1/15/17
to Stan users mailing list
Well, I found the culprit, after looking directly at it many, many times. The model contains a typo: I set a prior on the Kth parameter in the vector, instead of parameters 1:K. Correcting the index beta[K] to beta[l] allows the model to run successfully and much faster. Sigh. To try and learn something beyond checking my model religiously, I'm guessing unif(-inf, inf) priors were placed on the remaining beta parameters, and the possibility of extreme values ensured the softmax function would return ~(1,0,0,0,0,0) for most observations, because of the exponentiation. 

I appreciate the help, and thank you very much for this software and community.

-Kyle

On Friday, January 13, 2017 at 12:38:28 PM UTC-5, Kyle wrote:

Bob Carpenter

unread,
Jan 15, 2017, 5:43:10 PM1/15/17
to stan-...@googlegroups.com
In retrospect, these problems are always obvious.

The thing to do is work backward from where you know
the problem arises to the inputs to that function and
on back. That's where print() and conditionals can
be useful. It's also why we try to fail early when
possible---fewer steps to trace back through.

- Bob
Reply all
Reply to author
Forward
0 new messages