analysis of data with complex survey sampling

1,317 views
Skip to first unread message

josef...@gmail.com

unread,
Dec 22, 2016, 10:06:56 PM12/22/16
to pystatsmodels
<This is mostly a duplicate of what I just sent to pydata mailing list.>

a bit of background (not on pydata):

Handling of complex survey analysis is a basic requirement for statistics/econometrics packages.

I have been looking at it on an off for years after some early question. There were not many questions for it on the mailing list.
I think I have now enough entries to the topic to start to understand what's going on. Also, it's a bit difficult to see what can be handled by pandas and similar, and what we need to add to support it from the statsmodels side for the statistical and inference backend.
(I will not be sure what's going on until we have unit tests agains,t for example, Stata. "We don't have to understand the details if we can replicate what the main other statistics packages are doing.")


-----------
Does anyone use python, pandas, ... for the analysis of survey data with complex sampling, like stratified, clustered, ...?


I would like to make supporting of statistical methods and inference for complex survey data as one of the priorities for statsmodels for the next year. It's one of the topics that every statistical package is supposed to have, like survey in R, svy prefix in Stata or the corresponding methods in SAS.

I would think that pandas, or pandas_like packages for out-of core data handling, would be appropriate for most of the data handling and aggregation.

The two main features that we will need to add to statsmodels, AFAICS, are supporting probability weighting across models and across statistical functions, and providingstandard errors that are corrected for various complex sampling designs.

Is there are already any work in this area?
My search with google and on github came up essentially negative.

Also, if there is not much demand for it, I will put it on the backburner, however some generic things like probability weights are high priority independently of "survey" because they are also needed for other methods.
At the current stage, I would just like to make plans and figure out how difficult this will be.

Aside: From this master thesis it looks like pandas is doing pretty well in terms of performance in the case (s)he looked at
https://cs.brown.edu/research/pubs/theses/masters/2016/shao.qiming.pdf
(I didn't read it, I just looked at the pandas versus spark graphs.)

Josef


Stephen Childs

unread,
Dec 23, 2016, 11:17:33 AM12/23/16
to pystat...@googlegroups.com
I haven't used pandas or statsmodels for this, but I did some work with complex survey data (from Statistics Canada) in the past using Stata. I haven't had much call for it recently, but I will be starting a new position in the new year that is all about survey data and so it might be a good opportunity to get into this again.

I think if statsmodels is to replicate the functionality to Stata and other stats packages, it does need this support -- particularly probability weights.

So, I'd say there is some interest.

Stephen Childs

josef...@gmail.com

unread,
Dec 23, 2016, 12:07:27 PM12/23/16
to pystatsmodels
On Fri, Dec 23, 2016 at 10:55 AM, Stephen Childs <sech...@gmail.com> wrote:
I haven't used pandas or statsmodels for this, but I did some work with complex survey data (from Statistics Canada) in the past using Stata. I haven't had much call for it recently, but I will be starting a new position in the new year that is all about survey data and so it might be a good opportunity to get into this again.

I think if statsmodels is to replicate the functionality to Stata and other stats packages, it does need this support -- particularly probability weights.

So, I'd say there is some interest.

Thanks for the reply.

Weights will be coming as part of the generic support. It took me a long time this year to figure out which kind of weights we need and want. https://github.com/statsmodels/statsmodels/issues/2848
I preliminarily ended up with 4 types of weights, but we will be able to treat some of them identically as generic weights in the internal core calculations (`iweights`?).

I still have to figure out how to get the weights into all the robust sandwich covariance estimators, but this will also be part of the generic support.

However, I'm not sure what other survey specific support we need beyond those generic additions to our models. I still think, for example, Stata's svy prefix command is a bit "magic" and I'm still not sure what exactly it is doing. Some of it sounds mainly like choosing the right options in the "generic" models.

Any feedback and help for the survey specific parts would be great (especially given that I never worked with those).

Josef

geo...@survata.com

unread,
Dec 23, 2016, 8:37:56 PM12/23/16
to pystatsmodels
I would give a big +1 to fleshing out the survey support in statsmodels. 

(In case anyone hasn't been reading my other recent thread) I am a data engineer at Survata, a startup that conducts online surveys. Our clients are mostly market researchers--they design surveys on our site and we then find respondents for them by syndicating the survey out over a network of online publishers. (Respondents take a survey to unlock some content like an ebook or extra lives in a mobile game).

I joined earlier this year and have been building out tools to allow clients to analyze their survey results online. We're using Python for all the statistics--pandas/scipy/statsmodels, wrapped in a "microservice" using the Falcon web framework (which is really fast and has worked out great for making statistical Python code accessible via an API).

I didn't have any experience analyzing survey data before starting this job so I've been learning how to do it as I go. But as I've learned more it's become pretty clear that the R ecosystem has better tools for analyzing surveys than Python does at this point (mainly the "survey" library, but also ad-hoc libraries like MRCV that handle special cases like multiple-response question types.) 

It would be great to see statsmodels draw closer to parity for survey data. I'm currently working on a Python port of MRCV to contribute to statsmodels and may be interested in starting to translate pieces of the R survey package.

Survata isn't actually using complex sampling designs at this point (because most of our clients consider our publisher network to be close enough to simple random sampling). But I would like to improve the rigour of our sampling and analysis as much as I can going forward and having good tools in statsmodels would make that an easier road.

For a starting reference, the author of the R survey library actually has a textbook about survey data analysis using his library. It does a pretty good job of explaining the theory behind complex sampling / clustering / stratification / design affects and explaining why the library works the way it does.

-George

josef...@gmail.com

unread,
Dec 24, 2016, 2:39:26 PM12/24/16
to pystatsmodels


On Fri, Dec 23, 2016 at 8:37 PM, <geo...@survata.com> wrote:
>
> I would give a big +1 to fleshing out the survey support in statsmodels.
>
> (In case anyone hasn't been reading my other recent thread) I am a data engineer at Survata, a startup that conducts online surveys. Our clients are mostly market researchers--they design surveys on our site and we then find respondents for them by syndicating the survey out over a network of online publishers. (Respondents take a survey to unlock some content like an ebook or extra lives in a mobile game).
>
> I joined earlier this year and have been building out tools to allow clients to analyze their survey results online. We're using Python for all the statistics--pandas/scipy/statsmodels, wrapped in a "microservice" using the Falcon web framework (which is really fast and has worked out great for making statistical Python code accessible via an API).
>
> I didn't have any experience analyzing survey data before starting this job so I've been learning how to do it as I go. But as I've learned more it's become pretty clear that the R ecosystem has better tools for analyzing surveys than Python does at this point (mainly the "survey" library, but also ad-hoc libraries like MRCV that handle special cases like multiple-response question types.)
>
> It would be great to see statsmodels draw closer to parity for survey data. I'm currently working on a Python port of MRCV to contribute to statsmodels and may be interested in starting to translate pieces of the R survey package.


The usual license restrictions apply, the survey package is GPL.
Even if it were possible it wouldn't be useful for some of the core parts, because those will have to be integrated into our models and our framework (e.g. probability weights and covariance sandwiches).
Based on the list of dependencies of the R survey package, it doesn't look like it uses the sandwich package.

 
>
>
> Survata isn't actually using complex sampling designs at this point (because most of our clients consider our publisher network to be close enough to simple random sampling). But I would like to improve the rigour of our sampling and analysis as much as I can going forward and having good tools in statsmodels would make that an easier road.
>
> For a starting reference, the author of the R survey library actually has a textbook about survey data analysis using his library. It does a pretty good job of explaining the theory behind complex sampling / clustering / stratification / design affects and explaining why the library works the way it does.


I spot checked Lumley's book, and as most of those books it looks to be written for applied users and does not have enough "math" for implementing it. E.g. In the analytical details appendix he mainly refers to Binder's 1983 article for the sandwich/linearized covariance, which looks like is going back to Fuller articles from the 1970s.
(Binder is not very explicit about the estimation of the covariance of the moment condition, the center of the sandwich, but it looks like a standard Huber M-estimator sandwich which we have as cov_type="HC0" already available, and predates the more general developments of sandwiches after White 1980. Thats a bit of history on Eicker–Huber–White, and some add Godambe and some econometrician already mentioned it in the 40's or 50's, which I don't remember. AFAIK, the actual estimation theory picked up only with White.)

Another reference specific to categorical data in complex surveys
Mukhopadhyay, Parimal. 2016. Complex Surveys: Analysis of Categorical Data. Singapore: Springer Singapore. http://link.springer.com/10.1007/978-981-10-0871-9.

Based on a 5 minute look it has a lot of math, and looks useful for contingency table and loglinear type of analysis.

Josef

John E

unread,
Dec 25, 2016, 1:23:53 PM12/25/16
to pystatsmodels
FWIW, I'm coming at this as a user of datasets in economics that are almost always weighted.  I know that the machinery behind the scenes that generates these weights is quite complex, but I essentially don't care because I am just using these as frequency weights (minor stata annoyance is not being allowed to interpret non-integer weights as frequency weights btw).  I think this is pretty common for many datasets, but it's hard to say.

For example, I recently worked with the Survey of Consumer Finance, and mostly did it in stata due to the ease of tabstat (which allows weights) and the lack of something comparable AFAIK in pandas or statsmodels.  My guess (just a guess) is that there are lots of people doing tabs like that as opposed to fancier econometrics.  So from a potential number of users point of view, mere replication of stata's tabstat would be very popular, I would guess.

In summary, my experience suggests a huge benefit from just being able to make nice generic weighted tables (i.e. tabstat) and I don't think that is possible (easily) in pandas now, and I think that would be pretty easy to implement.  The fancier econometric pieces would be great, but I would hate for that to slow down development of more basic handling of weights in tables, graphs, etc.

josef...@gmail.com

unread,
Dec 25, 2016, 4:27:54 PM12/25/16
to pystatsmodels
On Sun, Dec 25, 2016 at 1:23 PM, John E <eil...@gmail.com> wrote:
FWIW, I'm coming at this as a user of datasets in economics that are almost always weighted.  I know that the machinery behind the scenes that generates these weights is quite complex, but I essentially don't care because I am just using these as frequency weights (minor stata annoyance is not being allowed to interpret non-integer weights as frequency weights btw).  I think this is pretty common for many datasets, but it's hard to say.

For example, I recently worked with the Survey of Consumer Finance, and mostly did it in stata due to the ease of tabstat (which allows weights) and the lack of something comparable AFAIK in pandas or statsmodels.  My guess (just a guess) is that there are lots of people doing tabs like that as opposed to fancier econometrics.  So from a potential number of users point of view, mere replication of stata's tabstat would be very popular, I would guess.

In summary, my experience suggests a huge benefit from just being able to make nice generic weighted tables (i.e. tabstat) and I don't think that is possible (easily) in pandas now, and I think that would be pretty easy to implement.  The fancier econometric pieces would be great, but I would hate for that to slow down development of more basic handling of weights in tables, graphs, etc.


One reason I asked this question is that I think that this should be a very popular usecase for pandas, and xarray and similar. However, I'm not very familiar with using pandas for data handling beyond the basics and was hoping that someone is working in this area.
Personally, I'm more specialized and interested in figuring out and adding the required inferential statistics and new sandwiches than figuring out how to do the relevant groupby or pivot table computations efficiently in pandas.
I am hoping that different developers can contribute to the parts where they have a comparative advantage, so the different parts don't have to slow each other down.

I'm planning on more kinds of weights in descriptive statistics, but that will most likely not be as convenient as having full pandas support for weights and will be slower than if pandas had weights in its cython backend.
(When I wrote one sample and two sample statistics in statsmodels.stats.weightstats, frequency weights were the only weights that I "understood", and therefore implemented. The other weights for this and similar functions will hopefully follow next year.)

Details on freq_weights:
We don't impose that they are integers, and they can be misused for other weights, as long as nobs/total weights and degrees of freedom are properly adjusted.
However, I read that in some difficult models that don't support weights, either SAS or SPSS (I don't remember which) uses replication to create a fully expanded sample that can be used with the standard unweighted methods. In those cases an integer restriction on freq_weights will be required.

Josef

Jeff Reback

unread,
Dec 25, 2016, 4:39:41 PM12/25/16
to pystat...@googlegroups.com
Josef

can you show a (hacked) up example (using numpy or python is fine) for what you would like here?

iow input frame -> output frame and the transformations you want

josef...@gmail.com

unread,
Dec 25, 2016, 5:13:22 PM12/25/16
to pystatsmodels
On Sun, Dec 25, 2016 at 4:39 PM, Jeff Reback <jeffr...@gmail.com> wrote:
Josef

can you show a (hacked) up example (using numpy or python is fine) for what you would like here?

iow input frame -> output frame and the transformations you want

I'm not really that far yet, John most likely knows better.

examples would be a weighted mean(weights=...), and similar for grouped weighted means
g_mean1 = df.groupby(...).mean(weights=...)
grouper = df.groupby(...)
g_mean2 = grouper.sum(weights * df) / grouper.sum(weights)

assert g_mean1 == g_mean2

(pandas usage is most likely not correct)

most likely an alternative will be to use a user/statsmodels defined mean_weighted  function
and grouper.apply(mean_weighted, df)
which I assume will do an explicit python loop

and similar for var and similar pandas built-in functions for descriptive statistics.

That's roughly what I understand, and is similar to what the frequency weighted statistics are doing.
There are some tricky details like degrees of freedom correction. Numpy had some discussion for weights in the covariance calculation where I get confused each time I look at the degrees of freedom correction.

Right now I was mostly skimming documentation and articles and didn't pay much attention to the details like who is doing which degrees of freedom correction or other adjustment where.

Josef

Jeff Reback

unread,
Dec 25, 2016, 5:21:35 PM12/25/16
to pystat...@googlegroups.com

josef...@gmail.com

unread,
Dec 25, 2016, 5:59:50 PM12/25/16
to pystatsmodels
On Sun, Dec 25, 2016 at 5:21 PM, Jeff Reback <jeffr...@gmail.com> wrote:

Thanks for the link.
I think that pretty much summarizes it and the associated problems. I don't know much about the pandas pattern for this, but once this is available it would be good to trickle it up to pivot table, cross-tabs or whatever other features pandas offers in this area to make life easier.

Essentially, the issue which type of weights and which computations depend on it, is what I or we have to figure out next year. When thequackdaddy and I added freq_weights to GLM, I still ran into some issues in the auxiliary results (interaction with exposure and how to adjust the sandwich robust covariance calculations). This part is still experimental and not completely clear yet.
However, compared to before I have now a lot more usecases where we need the weights, and the plan is to fully support them in the not to far future, but it will take time until they are fully supported in all parts of the models.

My guess but I haven't tried to implement it is that in the core computation we can get away with one, or at most two weights, and we can to other adjustments in the high level part, that just adjusts the aggregate values.

(In terms of performance I though of switching to `if weights.size == 1 and weight` == 1: <don't use weights>` to avoid multiplication with an array of ones each time as we do in some places right now.)

Renzo Massari

unread,
Sep 13, 2017, 12:14:58 AM9/13/17
to pystatsmodels
Hi! New guy here. @josefpktd , thanks for looking into this! It is extremely important. 
I'm an applied microeconometrician and have been working on survey analysis for development organizations (sp World Bank) for years. Rhough I'm trying to convert the research teams to Python... there's always the complex survey design issue
that has everyone in the community coming back to Stata (yes, they could use R, but Stata svy is so well integrated).
For survey design, we need *probability weights*, not the other types. Those other types are useful, but not what's relevant for survey analysis. I'm afraid that the github conversation linked to down here is about frequency weights. 
The real difference kicks in with the variance/covariance matrix only! "Analytical" weights and "probability" weights, for example, lead to the same estimated beta parameters in an OLS model; but the SE are different.
Other than weights, we'd need the error structure for stratification and for clustering.
A while ago (July 2016), I managed to match some results using statsmodels to Stata:

For adding weights:
  func = lambda x: np.asarray(x) * np.asarray(np.sqrt(weights))
And I applied that against all the variables. Then estimated an OLS model using the weighted y and X's.
That gives me beta parameters that match Stata's regress y X [pw = Weights]

To match se's: This matches the Stata results for regress y X [pw = Weights]
  import statsmodels.stats.sandwich_covariance as robust
  np.sqrt(robust.cov_white_simple(resultsW))
(resultsW is resultsW = model.fit() , where model is the weighted model)

To match clustered se's: This matches Stata's results of regress y X [pw = Weights], vce(cluster Weights)
  np.sqrt(robust.cov_cluster(resultsW, weights))

Please let me know if that helps.
I have not coded for statsmodels or any other packages before (just scripting data analysis), but was thinking I'd start ... then found this thread!
Has there any progress been done? Otherwise, if you have some patience, I'd love to help. I'm fluent in Python and R, but from the statistics and data science side. If you have some patience, I'd love to learn to develop the package by helping to add survey funcitionality
(as a reference: I started studying complex survey analysis in my PhD, which I finished a decade ago, and have worked on that for most of my professional career since, mostly at the research group at the world bank).

Please let me know!

josef...@gmail.com

unread,
Sep 13, 2017, 12:40:18 AM9/13/17
to pystatsmodels
Hi Renzo,

It would be great if you could work and contribute in this area. My background for 
survey data is mainly based on one month of readings last year.
Getting more reviewers and contributors that are familiar with the topic would
be a large help.

We just had a successful GSOC project that implemented large parts of the basic
methods and tied it in with some models.

It is not yet tied in with the sandwich cov_types, and there is still a lot of work to do 
to come close to Stata in coverage. Also our adding of weights is going slower than I
hoped for, we have frequency weights in GLM and var_weights for GLM almost
ready to merge.
The weights in WLS are generic for the basic results (params, cov_params) but are 
interpreted as var_weights (aweights) for the extra results (llf, ...).

Josef

Renzo Massari

unread,
Sep 14, 2017, 11:39:22 AM9/14/17
to pystat...@googlegroups.com
Hi Josef,
Thanks! I'm brushing on the statsmodel's developer page to get some bearing about how to proceed, then I'll try to jump in. Obviously, any guidance will be welcome!
Cheers,
R

------------------------------------------------------
"There are no routine statistical questions, only questionable statistical routines." David Cox

josef...@gmail.com

unread,
Sep 14, 2017, 12:13:39 PM9/14/17
to pystatsmodels
On Thu, Sep 14, 2017 at 11:39 AM, Renzo Massari <renzo....@gmail.com> wrote:
Hi Josef,
Thanks! I'm brushing on the statsmodel's developer page to get some bearing about how to proceed, then I'll try to jump in. Obviously, any guidance will be welcome!

The first task is to set up a development environment to check out and compile a statsmodels branch which depends on the operating system. I usually use one of my python versions for development that is running an inplace build of statsmodels master or branches.. (C compiler and a recent cython version is required for building from github checkout. Rebuilding is only necessary when the cython files change, which currently doesn't happen very often.)

The survey PR it is currently in the stage of reviewing a large PR.
I described a bit what I am usually doing in the recent thread for "Generalized mixed-effects linear model".

If anything is unclear, then just ask, and I will try to answer, or one of us will answer in the PR for specifics.

Cheers and thanks,

Josef

Renzo Massari

unread,
Oct 17, 2017, 6:54:28 PM10/17/17
to pystat...@googlegroups.com
Hi Josef!
I finally have some time to work on this for a while. Have you merged this branch into the statsmodels repository? I forked statsmodels, but the branch is not there yet (or I can't find it?).
Thanks,
Renzo

------------------------------------------------------
"There are no routine statistical questions, only questionable statistical routines." David Cox

josef...@gmail.com

unread,
Oct 17, 2017, 7:43:06 PM10/17/17
to pystatsmodels
Hi Renzo,

It is not merged. We will merge it after 0.9 is created and the survey branch has been more completely reviewed.

You can check out the branch directly, github provides the following instruction

git checkout -b jarvmiller-jarvisM_branch master
git pull https://github.com/jarvmiller/statsmodels.git jarvisM_branch

where jarvmiller-jarvisM_branch can be any name for the local branch or checkout.

If or when you have changes, then you can push to your github account 
and, for example, make pull requests against Jarvis' branch.

Either Jarvis, Kerby or I will merge different branches and rebase as needed. 
I guess it will still be a few more weeks before I will look at the details,
but I expect that we can merge it in a month or two.

Thanks,

Josef

Renzo Massari

unread,
Oct 18, 2017, 12:49:00 AM10/18/17
to pystat...@googlegroups.com
Excellent! Sorry for my ignorance *blush*

------------------------------------------------------
"There are no routine statistical questions, only questionable statistical routines." David Cox

Stas Kolenikov

unread,
Oct 13, 2021, 1:22:50 PM10/13/21
to pystatsmodels
What's the current status of this? I am a Python newbie while survey expert -- I know the math and the methods and the pitfalls, I have taught this for professional audiences in R and in Stata, and wrote some Stata packages for creation of complex survey weights and for variance estimation.

(To all the econometricians here: the chapter in the black Wooldridge's book on complex survey data analysis opens with three incorrect definitions of stratified samples and what not; and the clustered standard errors are determined by the sampling design rather than by staring at the wall and deciding whether to cluster by state or by county. Sorry, guys... the relatively few folks in the survey world don't have the bandwidth to deal with 10K economists and 50K medical doctors who can't seem to find the right resources. It took me several years to slowly shift my mindset from "the world is a regression/GMM model" to "the world is a finite population with fixed y and random indicators of being in the sample". AMA.)

josef...@gmail.com

unread,
Oct 13, 2021, 1:59:12 PM10/13/21
to pystatsmodels
On Wed, Oct 13, 2021 at 1:22 PM Stas Kolenikov <skol...@gmail.com> wrote:
What's the current status of this? I am a Python newbie while survey expert -- I know the math and the methods and the pitfalls, I have taught this for professional audiences in R and in Stata, and wrote some Stata packages for creation of complex survey weights and for variance estimation.

Unfortunately nothing has progressed in this area.

If you are interested and have examples in R or Stata that we should replicate, then I can work as a Python "programmer" for this.
But I don't have time to figure anything out in this area myself. I always got confused about design based versus model based approaches, and I never got a feeling for what we should implement..

A brief google and github search finds this in current development: https://samplics.readthedocs.io/en/latest/
Looks like it's focused on population parameter estimation and seems to have no linear or nonlinear regression models.
It looks like it's also the only one I have seen in the past https://github.com/statsmodels/statsmodels/issues/6400

And here's my latest issue about what we should add to models to support survey data

Josef

 
--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/190f7ac4-6cb4-494d-8bdc-39c40778dfe5n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages