Stan for complex survey designs?

702 views
Skip to first unread message

Peter Phalen

unread,
Feb 25, 2016, 10:45:57 AM2/25/16
to Stan users mailing list

Hi all!

I've been doing some work with complex survey designs, specifically, using Thomas Lumley's survey package for R to run differences-in-differences analyses using the National Health Interview Survey (NHIS). The NHIS uses weighting, nested stratification, and clusters. With the R survey package I can simply set these parameters into the svydesign function and all the pesky math is done for me. This is great, because my grasp on the equations underlying statistics with complex survey designs is very limited.

I would love to be able to take a Bayesian approach to my research questions using Stan. In particular, I have been trying to figure out how I would run logistic regressions that take weights, strata, and clusters into account. It would have to involve an interaction term to capture the differences-in-differences effect.

Has anyone tried using Stan with complex survey designs like these? Is there a way to do this that isn't prohibitively complicated?

Thank you very much.

Peter

Andrew Gelman

unread,
Feb 25, 2016, 9:34:05 PM2/25/16
to Peter Phalen, Jonah Gabry, Shira Mitchell, Susanna Maria Makela, Chen, Qixuan, Yajuan Si, Stan users mailing list
Hi Peter.  I’m cc-ing some of my colleagues, and also stan-users, on this.  Survey analysis using Stan is a topic close to my heart.  We have some ideas on this but have not yet put it all together.  We hope to, though!
Andrew

On Feb 25, 2016, at 8:56 AM, Peter Phalen <peter....@gmail.com> wrote:

Hi Andrew,

I've been messing around with Stan for a little while, trying to learn the ropes. I've also been working on a paper analyzing National-level health disparities before and after 'Obamacare' using Thomas Lumley's survey package. It's currently in the revise and resubmit stage.

I would love to be able to switch to bayesian modeling for these complex design surveys. I found your 2007 work, "Struggles with survey weighting and regression modeling." To be frank, as a beginner in bayesian thinking, it goes over my head. Does the Stan team have any plans to implement a framework for handing complex survey designs? Or, do you have any plans to illustrate how this could be done in a blog post or a publication?

Thank you so much! I'm a daily reader of your blog.

-Peter

--
Peter Phalen, MA
206.371.1107

Doctoral Student in Psychology
University of Indianapolis
pha...@uindy.edu

Research Assistant
Roudebush VA Medical Center
Indiana University Dept. of Psychiatry


Bob Carpenter

unread,
Feb 25, 2016, 11:11:04 PM2/25/16
to stan-...@googlegroups.com, Peter Phalen, Shira Mitchell, Susanna Maria Makela, Chen, Qixuan, Yajuan Si
Is there a particular model you have in mind? The problem
with building general frameworks for complex designs is
that they're complex. And even if we did build a framework,
we'd want to start from a few example models.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Andrew Gelman

unread,
Feb 25, 2016, 11:11:55 PM2/25/16
to stan-...@googlegroups.com, Peter Phalen, Shira Mitchell, Susanna Maria Makela, Chen, Qixuan, Yajuan Si
Good point. Working one example at a time is a good way to go.

Peter Phalen

unread,
Feb 25, 2016, 11:42:58 PM2/25/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
I use somewhat more elaborate models in my paper but a simple scenario that I think covers all the relevant cases I have in mind could be as follows...

Group variables: 
Serious Mental Illness (binary: absent v. present)
Survey year (binary: pre v. post)
Serious Mental Illness:Year (interaction between the two variables, which happens to equal the difference-in-difference, aka changes in the disparity)

Outcome:
Health Insurance Status (binary: covered v. not covered)

Complexity: 
Survey has clusters, nested strata, and weights.

Is this what you were pulling for? Thank you in advance for all of your help. 

Jonah Gabry

unread,
Feb 26, 2016, 12:33:48 AM2/26/16
to stan-...@googlegroups.com
Most of that shouldn't be too difficult to do in Stan. It gets a bit hazy when survey weighting enters the picture though. Simply weighting the log likelihood contributions can be considered non Bayesian in that there's no generative model. But people still do it. But can you include all relevant variables in the model and then do some post-stratification if necessary? I know in many cases that's not feasible.

Jonah

Andrew Gelman

unread,
Feb 26, 2016, 12:36:44 AM2/26/16
to stan-...@googlegroups.com
Yes, we want to do the poststrat on the survey weights. We should be able to come up with something simpler and more general than this:
http://www.stat.columbia.edu/~gelman/research/published/Si_et_al-BA14.pdf
A

> On Feb 26, 2016, at 12:33 AM, Jonah Gabry <jga...@gmail.com> wrote:
>
> Most of that shouldn't be too difficult to do in Stan. It gets a bit hazy when survey weighting enters the picture though. Simply weighting the log likelihood contributions can be considered non Bayesian in that there's no generative model. But people still do it. But can include all relevant variables in the model and then do some post-stratification if necessary? I know in many cases that's not feasible.
>
> Jonah


Jonah Sol Gabry

unread,
Feb 26, 2016, 12:40:39 AM2/26/16
to stan-...@googlegroups.com
I never did fully understand that paper. I should read it again, but something simpler is definitely needed.


On Friday, February 26, 2016, Andrew Gelman wrote:
Yes, we want to do the poststrat on the survey weights.  We should be able to come up with something simpler and more general than this:
http://www.stat.columbia.edu/~gelman/research/published/Si_et_al-BA14.pdf
A

> On Feb 26, 2016, at 12:33 AM, Jonah Gabry wrote:
>
> Most of that shouldn't be too difficult to do in Stan. It gets a bit hazy when survey weighting enters the picture though. Simply weighting the log likelihood contributions can be considered non Bayesian in that there's no generative model. But people still do it. But can include all relevant variables in the model and then do some post-stratification if necessary? I know in many cases that's not feasible.
>
> Jonah


--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/nSgDmGt0cCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Feb 26, 2016, 1:32:06 AM2/26/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
We really need to start with a concrete model defined
in terms of a prior and likelihood (really just anything
proportional to the posterior). I'd recommend starting
simple and then building up to something more complex as
you get a feel for Stan.

Maybe we'll get some examples together before Andrew
next teaches his survey sampling class.

- Bob

Peter Phalen

unread,
Feb 26, 2016, 7:50:19 AM2/26/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Gotcha. I've got a busy day so this may be awhile but I will try to write a .stan file for the model minus the complex survey variables.. 

terrance savitsky

unread,
Feb 26, 2016, 11:27:57 AM2/26/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Hello Stanners, I attach 2 papers that, together, offer a very general approach for incorporating information from the sampling design distribution (to account for an informative sampling design) using the first order sampling weights.  The first paper illustrates the formulation and use of the pseudo posterior.  The second paper provides the frequentist consistency properties of the pseudo posterior.  To Jonah's point, the approach incorporates the first order sampling weights in a plug-in fashion, so it is not fully bayesian.

our observed sample may be viewed to come from two successive processes; 1. the (unknown) population generating distribution; 2. the (known) sampling design distribution.   The observed sample, y, is - therefore - not able to "comment" on the sampling design distribution.   One may view a fully Bayesian approach as co-modeling the first order sampling weights - in direct association with y or through a shared parameter approach.   A challenge here is that the distribution for weights, w, conditioned on y in the observed sample is not the same as that for the population.  So one must impute quantities from the non-sampled units.  Secondly, the first order weights don't convey the dependence structure among the sampled units (e.g., in a sampling without replacement design).   One would need higher order inclusion probabilities.   

using the plug-in approach also doesn't capture the dependence structure, but pfefferman and sverchkov show that the conditional *expectation* of the (first order) weights given the response under the distribution for the observed / sampled units is equal to that under the distribution for the population.   the implication is that one may - under an inefficient sampling design - smooth the weights for the observed sample by regressing them on the response and then using those smoothed weights to form the pseudo posterior distribution described in the attached papers.

The use of the pseudo posterior approach has the virtue that it can be used for any population generating model without changing the parameterization.  This is especially useful where the sampling design is viewed as a nuisance or where the analyst may not know the sampling design to parameterize it in the model.   Additionally, using the pseudo posterior allows one to retain the geometry of the MCMC sampler.

Terrance Savitsky.
Thank you, Terrance Savitsky
posis_123115_v1.pdf
bis_072215_ims.pdf

Peter Phalen

unread,
Mar 1, 2016, 10:29:49 AM3/1/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Sorry for the delay. It's been a really busy week. Bob asked for a concrete model that we could build into something beautiful. Here is my attempt at defining a really basic .stan file that runs a linear regression with two covariates and an interaction term between the variables of interest for the sake of determining differences in differences. Note there is no attempt to take weights into account. 

For what it's worth, here are the three survey design variables provided by the NHIS:
cluster: psu_p
strata: strat_p
weight: wtfa_sa

And here is the stan file. This uses a flat prior, I believe, but please let me know if I should adjust the priors. I have run pretty exhaustive descriptive statistics on this data so I know for example what the average rate of health coverage is for people in general (irrespective of SMI status). Perhaps that information could be used to determine a prior. Again, I apologize for the obvious ignorance about best practices in bayesian statistics. 

Here is my ComplexSurveyDesign.stan . It is not complex yet.. 

 data {
   // Define variables in data
   // Number of observations (an integer)
   int<lower=0> N;
   // Number of parameters
   int<lower=0> p;

   // Variables:
   
   // outcome
   int coverage[N]; // 0=No health insurance, 1=Covered

   //two covariates 
   int<lower=1, upper=2> sex[N];  // 1 = male, 2 = female

   int<lower=1, upper=4> race[N];  // 1= Hispanic
                 // 2= Non-hispanic white, 
               // 3=non-hispanic black, 
                       // 4 = non-hispanic asian, 5=non-hispanic other
   
   // variables of interest
   int<lower=0, upper=1> SMI[N]; // 0=None, 1=Present
   int<lower=0, upper=1> year[N] // 0 = Pre, 1=Post
 }
 
 parameters {
   // Define parameters to estimate
   real beta[p];
   
   // standard deviation (a positive real number)
   real<lower=0> sigma;
 }
 
 transformed parameters  {
   // Mean
   real mu[N];
   for (i in 1:N) {
   mu[i] <- beta[1] + beta[2]*sex[i] + beta[3]*race[i] + beta[4]*SMI[i] + beta[5]*year[i] + beta[6]*year[i]*SMI[i]; //interaction term should represent difference-in-difference (i.e., change in the disparity between years
   }
 }
 
 model {
   // Prior part of Bayesian inference (flat if unspecified)
 
   // Likelihood part of Bayesian inference
     coverage ~ normal(mu, sigma);  
 }


Peter Phalen

unread,
Mar 1, 2016, 12:13:03 PM3/1/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
I'm sorry I guess I was on autopilot in setting those lower and upper bounds. I believe the beta parameter should have a lower bound of -1 and an upper bound of 1, because these betas correspond to percent changes over time.

Sorry about that.

Also, for what it's worth, I posted an R script that includes code for pulling and prepping the NHIS data into an R readable format. It's the same function I use myself, essentially: https://github.com/peterphalen/Survey-data-analyses-using-R/blob/master/National%20Health%20Interview%20Survey%20(NHIS)/Results%20for%202010-2014%20(unaltered%20autogenerated%20output%20from%20the%20Load%20and%20Analyze%20NHIS%20Data%20script).txt 

Andrew Gelman

unread,
Mar 1, 2016, 4:12:41 PM3/1/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
In general I don’t think it makes sense to put in such hard constraints unless they’re part of the mathematical definition of the model.  If you think a parameter should be between -1 and 1, maybe better to give it a normal(0,1) prior.
A

Bob Carpenter

unread,
Mar 1, 2016, 6:03:09 PM3/1/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
I think you need to do some background reading on basic
Bayesian stats. I'd strongly recommend Gelman and Hill's
book on regression, though it's paywalled.

I have some more specific comments inline.

> On Mar 1, 2016, at 12:13 PM, Peter Phalen <peter....@gmail.com> wrote:
>
> I'm sorry I guess I was on autopilot in setting those lower and upper bounds. I believe the beta parameter should have a lower bound of -1 and an upper bound of 1, because these betas correspond to percent changes over time.
>
> Sorry about that.
>
> Also, for what it's worth, I posted an R script that includes code for pulling and prepping the NHIS data into an R readable format. It's the same function I use myself, essentially: https://github.com/peterphalen/Survey-data-analyses-using-R/blob/master/National%20Health%20Interview%20Survey%20(NHIS)/Results%20for%202010-2014%20(unaltered%20autogenerated%20output%20from%20the%20Load%20and%20Analyze%20NHIS%20Data%20script).txt
>
>
>
>
>
> On Tuesday, March 1, 2016 at 10:29:49 AM UTC-5, Peter Phalen wrote:
> Sorry for the delay. It's been a really busy week. Bob asked for a concrete model that we could build into something beautiful. Here is my attempt at defining a really basic .stan file that runs a linear regression with two covariates and an interaction term between the variables of interest for the sake of determining differences in differences. Note there is no attempt to take weights into account.
>
> For what it's worth, here are the three survey design variables provided by the NHIS:
> cluster: psu_p
> strata: strat_p
> weight: wtfa_sa
>
> And here is the stan file. This uses a flat prior, I believe, but please let me know if I should adjust the priors. I have run pretty exhaustive descriptive statistics on this data so I know for example what the average rate of health coverage is for people in general (irrespective of SMI status). Perhaps that information could be used to determine a prior. Again, I apologize for the obvious ignorance about best practices in bayesian statistics.
>
> Here is my ComplexSurveyDesign.stan . It is not complex yet..
>

You don't want all these doc comments explaining how the language
works, like "Define variables in data" and "(an integer)" --- those
are all built into the language. I'll edit for you. And


> data {
> // Define variables in data
> // Number of observations (an integer)
> int<lower=0> N;
> // Number of parameters
> int<lower=0> p;
>
> // Variables:
>
> // outcome
> int coverage[N]; // 0=No health insurance, 1=Covered
>
> //two covariates
> int<lower=1, upper=2> sex[N]; // 1 = male, 2 = female

usually easier if indicators are 0/1, but this is OK too.

> int<lower=1, upper=4> race[N]; // 1= Hispanic
> // 2= Non-hispanic white,
> // 3=non-hispanic black,
> // 4 = non-hispanic asian, 5=non-hispanic other

your bounds only go to 4

>
> // variables of interest
> int<lower=0, upper=1> SMI[N]; // 0=None, 1=Present
> int<lower=0, upper=1> year[N] // 0 = Pre, 1=Post
> }
>
> parameters {
> // Define parameters to estimate
> real beta[p];
>
> // standard deviation (a positive real number)
> real<lower=0> sigma;
> }
>
> transformed parameters {
> // Mean
> real mu[N];
> for (i in 1:N) {
> mu[i] <- beta[1] + beta[2]*sex[i] + beta[3]*race[i] + beta[4]*SMI[i] + beta[5]*year[i] + beta[6]*year[i]*SMI[i]; //interaction term should represent difference-in-difference (i.e., change in the disparity between years

This can all be vectorized and defined as a local variable
in the model block

But sex and race are broken; you need varying intercepts
based on N - 1 of the races. And you only include the coefficient for
one of the sex values --- the baseline goes into the general intercept.

You can use a one-hot encoding if you really must, but it's very
inefficient.

> }
> }
>
> model {
> // Prior part of Bayesian inference (flat if unspecified)
>
> // Likelihood part of Bayesian inference
> coverage ~ normal(mu, sigma);
> }

You don't want to model a 0/1 outcome as normal. Perhaps a logistic regression?
And then you don't need sigma;

coverage ~ bernoulli_logit(mu);

Good luck.

- Bob

Peter Phalen

unread,
Mar 1, 2016, 6:16:37 PM3/1/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Yeah I think I misrepresented the level of my question. I am absolutely not your guy if the goal here is to figure out how to analyze complex survey designs using Bayesian methods. I began learning about the actual math behind Bayesian statistics for the first time approximately one (1) week ago. 

I was more wondering if the analysis of complex survey designs was feasible in stan in the first place and how this would look. I am a long way from coming up with the equation myself!

Bob Carpenter

unread,
Mar 1, 2016, 6:27:08 PM3/1/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Given that I don't know what you mean by "complex survey
designs", it's hard to answer. I'll defer to Andrew and
everyone he cc-ed.

- Bob

Andrew Gelman

unread,
Mar 1, 2016, 6:30:21 PM3/1/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Just about any real survey of humans is a complex survey design in practice.

Peter Phalen

unread,
Mar 1, 2016, 6:35:39 PM3/1/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu, gel...@stat.columbia.edu
Touché. By "complex survey designs" I'm meaning to refer to survey samples that have clusters, stratification, and/or weights.

Andrew Gelman

unread,
Mar 1, 2016, 6:42:15 PM3/1/16
to Peter Phalen, Stan users mailing list, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Just about any real survey of humans is a complex survey design in practice has at least one, typically two or three of these complications!

Bob Carpenter

unread,
Mar 1, 2016, 6:51:07 PM3/1/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu, gel...@stat.columbia.edu
Yes, we can do all of those things in Stan.

- Bob

Jonah Sol Gabry

unread,
Mar 2, 2016, 12:18:55 AM3/2/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu, gel...@stat.columbia.edu
And regarding the constraints on the betas: Bob's suggestion of using the logit model takes care of that for you. The betas in your model will be unconstrained because they will be on the logit (log-odds) scale so you don't need to use any constraints. 

Essentially you want a multilevel logistic regression model, which you can read about in the Gelman and Hill book Bob suggested. 

Also, you might want to consider passing in your covariates as a matrix created by something like R's model.matrix function, which will create all of the indicator variables and interactions you need for you. 

Hope that helps,

Jonah

P.S. You might get the impression from this thread that we don't really know how to do complex survey models in Stan. That's true, but that's just because they're hard to do well in general. If you're going to do it (and I think you should!) then I do think Stan's the best option.
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/nSgDmGt0cCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Jonah Sol Gabry

unread,
Mar 2, 2016, 12:20:40 AM3/2/16
to stan-...@googlegroups.com, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Thanks for sharing the papers Terrance. I'll definitely give them a look.

Jonah
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/nSgDmGt0cCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

CoreySparks

unread,
Mar 2, 2016, 11:56:10 AM3/2/16
to Stan users mailing list, peter....@gmail.com, sm2...@columbia.edu, smm...@columbia.edu, qc2...@cumc.columbia.edu, y...@biostat.wisc.edu
Hello all, 
I've been following this thread, and it's very near and dear to my own heart as well. Since most of the time, if I may, people analyzing social/health survey data aren't necessarily interested in obtaining predictions from their models, but more interested in asking questions of the data/testing hypotheses, then issues of analyzing complex survey data have special problems when using hierarchical or bayesian models.  I've tried to illustrate this with an Rpubs doc:


that shows how, using the CDC's Behavioral Risk Factor Surveillance System data (national health survey with stratified design, carried out via phone), that by not fully accounting for homogeneity within strata, analysts get the wrong results (biased betas, wrong standard errors). By adding the sampling weight, we can get unbiased beta's, and if we wanted to do prediction, that would probably be enough. But if we want to ask questions, then the standard errors for the beta's also have to be correct, or our test statistics will be wrong (generally the t/z tests are "too significant' in this case). 

I've tried to milk lmer into doing this by putting in the sample design variables (survey weight, and stratum identifiers) into the model (strata, I treat as a random intercept), but I do not get standard errors for the beta's that are in line with what you get by applying Taylor Series linearization of the variance function using the svy_glm() function in the survey package, which is generally the agreed upon method for analyzing survey data (along with using replicate weights, etc, but that depends on the survey under consideration).  The SE's are still too low, even after me trying to control for within strata homogeneity, by using the random intercepts. 

In the Rpubs example above, I illustrate the various methods of how I did this (no weights/survey information, using lmer with weights&strata, using full survey methods

I've attached below the stan file generated from brms, and would welcome the continued discussion of this topic, with a more concrete example now.
Cheers to all,
Corey
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thank you, Terrance Savitsky

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/nSgDmGt0cCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.
fit_glm_stan.txt
Reply all
Reply to author
Forward
0 new messages