analysis of data with complex survey sampling

47 views

Skip to first unread message

josef...@gmail.com

unread,

Dec 22, 2016, 9:43:26 PM12/22/16

to pyd...@googlegroups.com

Does anyone use python, pandas, ... for the analysis of survey data with complex sampling, like stratified, clustered, ...?

I would like to make supporting of statistical methods and inference for complex survey data as one of the priorities for statsmodels for the next year. It's one of the topics that every statistical package is supposed to have, like survey in R, svy prefix in Stata or the corresponding methods in SAS.

I would think that pandas, or pandas_like packages for out-of core data handling, would be appropriate for most of the data handling and aggregation.

The two main features that we will need to add to statsmodels, AFAICS, are supporting probability weighting across models and across statistical functions, and providingstandard errors that are corrected for various complex sampling designs.

Is there are already any work in this area?
My search with google and on github came up essentially negative.

Also, if there is not much demand for it, I will put it on the backburner, however some generic things like probability weights are high priority independently of "survey" because they are also needed for other methods.

At the current stage, I would just like to make plans and figure out how difficult this will be.

Aside: From this master thesis it looks like pandas is doing pretty well in terms of performance in the case (s)he looked at
https://cs.brown.edu/research/pubs/theses/masters/2016/shao.qiming.pdf

(I didn't read it, I just looked at the pandas versus spark graphs.)

Josef

John E

unread,

Dec 25, 2016, 1:30:38 AM12/25/16

to PyData

I mostly work with economics data sets and all are weighted. Just making weighted tabulations easier would be great. E.g. stata's tabstat (with weights and the by option) is a great tool for doing all sorts of weighted tables.

Reply all

Reply to author

Forward

0 new messages