Does anyone use python, pandas, ... for the analysis of survey data with complex sampling, like stratified, clustered, ...?
I would like to make supporting of statistical methods and inference for complex survey data as one of the priorities for statsmodels for the next year. It's one of the topics that every statistical package is supposed to have, like survey in R, svy prefix in Stata or the corresponding methods in SAS.
I would think that pandas, or pandas_like packages for out-of core data handling, would be appropriate for most of the data handling and aggregation.
The two main features that we will need to add to statsmodels, AFAICS, are supporting probability weighting across models and across statistical functions, and providingstandard errors that are corrected for various complex sampling designs.
Is there are already any work in this area?
My search with google and on github came up essentially negative.
Also, if there is not much demand for it, I will put it on the backburner, however some generic things like probability weights are high priority independently of "survey" because they are also needed for other methods.
At the current stage, I would just like to make plans and figure out how difficult this will be.
(I didn't read it, I just looked at the pandas versus spark graphs.)
Josef