RStanARM: posterior_predictive probabilities for binomial GLMM

787 views
Skip to first unread message

Michael Weylandt

unread,
Feb 15, 2016, 11:24:10 PM2/15/16
to stan-...@googlegroups.com
I've fit binomial GLMM with RStanARM. When I use posterior_predict, I
get back counts (instead of probabilities).

My each 'row' of my data has different numbers of observations, so I'm
not sure what N is being used for these predictions.

Is it possible to get back the "raw" posterior predictive p, rather than p*N?

## Minimal Example
library(rstanarm)

D <- data.frame(S=rep(c("M", "F"), each=3),
X=rep(1:3, each=2),
K=c(10, 20, 30, 40, 50, 60),
N=c(120, 100, 80, 60, 120, 60))

M <- stan_glmer(cbind(K, N-K) ~ X + (1|S), data=D, family=binomial)

# What counts are these?
posterior_predict(M, newdata=data.frame(X=4), re.form=~0)

Thanks,
Michael

Jonah Gabry

unread,
Feb 16, 2016, 11:54:28 AM2/16/16
to Stan users mailing list
Hi Michael, 

Can you update to the just-released update of rstanarm on CRAN (version 2.9.0-3)? We added a Note section to the documentation for posterior_predict that explains how N is handled for binomial models and changed some things internally related to this.

If you don't specify newdata when calling posterior_predict the Ns are the Ns from the original data. If you do specify newdata then you should be required to specify your own Ns and get an error otherwise (the error wasn't being thrown in the first version of rstanarm). In your case, since the LHS of your formula is cbind(K, N-K) you could include in newdata a variable K=0 and a variable N that has the number of trials you want to use. If there are multiple observations in newdata N can be different for each of them. The predicted counts can then easily be divided by the number of trials if you want probabilities.

Relatedly, we don't recommend trying to directly use the posterior distribution of the inverse-logit transformed linear predictor (i.e., the predicted probabilities), but rather to always sample from the posterior predictive distribution of the outcome. If you want you can then get probabilities by division, as mentioned above. 

Also, in case it's helpful, one of the examples in the How to Use the rstanarm Package vignette has been updated in the new version of the package and now shows how to correctly specify the number of trials to use for prediction after binomial models.

Hope that helps, 

Jonah

Tom Wallis

unread,
Mar 16, 2016, 7:04:47 PM3/16/16
to Stan users mailing list
Hi Jonah,

I'm bumping this thread because I have a related question. 

I have a *bernoulli* GLMM for which I would like to plot predicted *probabilities* (similar to e.g. the predict method for a normal R glm, type = "response"). Is the only way to approximate these by switching the model formula to *binomial* (with one trial per row in my original data) and then using posterior_predict to approximate the probabilities?


On Tuesday, February 16, 2016 at 5:54:28 PM UTC+1, Jonah Gabry wrote:
Hi Michael, 

Can you update to the just-released update of rstanarm on CRAN (version 2.9.0-3)? We added a Note section to the documentation for posterior_predict that explains how N is handled for binomial models and changed some things internally related to this.

If you don't specify newdata when calling posterior_predict the Ns are the Ns from the original data. If you do specify newdata then you should be required to specify your own Ns and get an error otherwise (the error wasn't being thrown in the first version of rstanarm). In your case, since the LHS of your formula is cbind(K, N-K) you could include in newdata a variable K=0 and a variable N that has the number of trials you want to use. If there are multiple observations in newdata N can be different for each of them. The predicted counts can then easily be divided by the number of trials if you want probabilities.

Relatedly, we don't recommend trying to directly use the posterior distribution of the inverse-logit transformed linear predictor (i.e., the predicted probabilities), but rather to always sample from the posterior predictive distribution of the outcome. If you want you can then get probabilities by division, as mentioned above. 

Could you expand on why directly using the posterior of the inv-logit is not recommended?

Jonah Gabry

unread,
Mar 16, 2016, 8:46:34 PM3/16/16
to Stan users mailing list

On Wednesday, March 16, 2016 at 4:04:47 PM UTC-7, Tom Wallis wrote:
Hi Jonah,

I'm bumping this thread because I have a related question. 

I have a *bernoulli* GLMM for which I would like to plot predicted *probabilities* (similar to e.g. the predict method for a normal R glm, type = "response"). Is the only way to approximate these by switching the model formula to *binomial* (with one trial per row in my original data) and then using posterior_predict to approximate the probabilities?


For a bernoulli model you can use a function like this, making use of some rstanarm internal functions,

predicted_prob <- function(fit) {
  dat <- rstanarm:::pp_data(fit)
  eta <- rstanarm:::pp_eta(fit, dat)$eta
  plogis(eta) # inverse-logit 
}

We debated adding a user-facing function like this, but ...


Could you expand on why directly using the posterior of the inv-logit is not recommended?
 

It's not that it's particularly discouraged, we just want to encourage people to work with predicted outcomes, i.e. the full posterior predictive distribution, rather than just (functions of) the linear predictor.

Maybe we should reconsider our decision and provide a convenience function for this in the package.

Jonah 

Tom Wallis

unread,
Mar 17, 2016, 4:55:00 AM3/17/16
to Stan users mailing list
Ok great, thanks for the reply Jonah.

I think a user-facing function for this would be really useful (similar to "predict(type = response))". The reason my data are bernoulli rather than binomial is that I've got two random effects that cross over. What I'd like to do is plot the predicted probabilities marginalising over the levels of the random effects, against aggregated data (also marginalising over the levels of the random effects). Requiring this to be done by simulating N binomial draws from the model, then averaging those draws, seems unnecessary. How many draws should be used? If one is interested in the plogis, why not allow the uncertainty on this to be visualised directly, rather than adding another source of uncertainty (the number of binomial draws used to approximate it)?

Tom Wallis

unread,
Mar 17, 2016, 9:42:37 AM3/17/16
to Stan users mailing list
In the example code you provided, is there a way to pass new data (i.e. a data frame to make predictions over) to those internal methods? 

My attempt 

predicted_prob <- function(fit, newdata = NULL) {
  if (is.null(newdata)) {
    dat <- rstanarm:::pp_data(fit)
    eta <- rstanarm:::pp_eta(fit, dat)$eta
    plogis(eta) # inverse-logit    
  } else {
    eta <- rstanarm:::pp_eta(fit, newdata)$eta
    plogis(eta) # inverse-logit   
  }
}

gives a bug in pp_eta, presumably because the dataframe I'm passing doesn't correspond to some internal dataframe expected.

Jonah Sol Gabry

unread,
Mar 17, 2016, 2:11:13 PM3/17/16
to stan-...@googlegroups.com
On Thu, Mar 17, 2016 at 6:42 AM, Tom Wallis <tsaw...@gmail.com> wrote:
In the example code you provided, is there a way to pass new data (i.e. a data frame to make predictions over) to those internal methods? 

My attempt 

predicted_prob <- function(fit, newdata = NULL) {
  if (is.null(newdata)) {
    dat <- rstanarm:::pp_data(fit)
    eta <- rstanarm:::pp_eta(fit, dat)$eta
    plogis(eta) # inverse-logit    
  } else {
    eta <- rstanarm:::pp_eta(fit, newdata)$eta
    plogis(eta) # inverse-logit   
  }
}

gives a bug in pp_eta, presumably because the dataframe I'm passing doesn't correspond to some internal dataframe expected.

I think you need to pass newdata to pp_data instead of pp_eta. Does that work for you?

Tom Wallis

unread,
Mar 17, 2016, 3:18:08 PM3/17/16
to stan-...@googlegroups.com

No, that also resulted in an error (fails on is.mer).

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/GS8_8djvke4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonah Gabry

unread,
Mar 17, 2016, 6:02:40 PM3/17/16
to Stan users mailing list
Hey Tom, 

I've moved this over to a GitHub issue (https://github.com/stan-dev/rstanarm/issues/85) so we can discuss whether to add this to the package. In the meantime, I added a branch to the rstanarm repository with a function for getting the posterior distribution of the linear predictor (optionally with new data). I put some instructions for installing in the comments at the new issue.

Jonah
Reply all
Reply to author
Forward
0 new messages