Handling survey weights in OLS.

Tiarnán de Burca

unread,

Jun 14, 2017, 4:26:32 PM6/14/17

to pystatsmodels

Hi,

Forgive me if this is a silly question, I'm new to doing this type of work.

I have a collection of survey data, from the eurobarometer polls taken across europe.

In the following code y is a pandas series and the dependent variable and x is a pandas dataframe with dummy vars to reflect a property.

(For example if a respondent supports the Prime Minister's Party that variable for that person would be coded as 1.)

I'm currently doing an Ordinary Least Squares regression in the form:

model = sm.OLS(y, X)

results = model.fit(cov_type='HC0')

print(results.summary())

I would like to move to doing a Weighted Least Squares in the form:

wls_model = sm.WLS(y,X, weights=weights)

res2 = wls_model.fit(cov_type='HC0')

print(res2.summary())

However I'm not sure how to present the 'weights' variable to the WLS function.

Currently if a respondent is from Ireland the weight would be '.2', if they're Germany '4.8', and so on for the various countries.

What's the 'correct' form for the an entry in the weights list to be?

In case it helps, the code book describes my weights variable as:

"THIS NUMBER CREATES A TOTAL SAMPLE IN WHICH THE NATIONAL SAMPLES ARE PROPORTIONATE TO THEIR RESPECTIVE NATION'S SHARE OF THE POPULATION OF THE EUROPEAN COMMUNITY"

(Please forgive the all-caps I thought it was best to share exactly how it's described in the codebook.)

Thanks for any help you might be able to provide, and for the work you've done on the statsmodels package.

T.

josef...@gmail.com

unread,

Jun 14, 2017, 4:53:19 PM6/14/17

to pystatsmodels

I think you can just specify weights=sampling_weights. The only result
that is affected by the normalization (weights.sum() is arbitrary) are
the units of the scale. parameter estimates and covariance of the
parameter estimates are (should be) independent of the interpretation
of the weights.

AFAICS, WLS does not allow weighted prediction. The current weights in
WLS are interpreted as variance weights reflecting heteroscedasticity.
(There might be a work around for WLS but I have not looked at this
yet.)

However, we are not supporting survey weights yet. While I think the
above is correct and should work with WLS, I would recommend checking
against Stata or similar.

The question is coming maybe half a year too early. We just started a
Google summer of code project on survey methods. Additionally, GLM is
the first model that is currently getting several types of weights,
where we figure out how to support this more generally. Once we have
more unit tests against Stata's weights, and specifically pweights for
this, we will know what works and what doesn't.

Josef

Tiarnán de Burca

unread,

Jun 15, 2017, 5:32:31 AM6/15/17

to pystatsmodels

Josef,

Thanks a million for getting back to me.

However, we are not supporting survey weights yet. While I think the
above is correct and should work with WLS, I would recommend checking
against Stata or similar.

It sounds a like statsmodel may not be the tool to do this work in, are there any other pythonic tools that allow for this type of work?

I think the thing I'm trying to do is "OLS with Survey Weights", does that sound right?

The question is coming maybe half a year too early. We just started a
Google summer of code project on survey methods. Additionally, GLM is
the first model that is currently getting several types of weights,
where we figure out how to support this more generally. Once we have
more unit tests against Stata's weights, and specifically pweights for
this, we will know what works and what doesn't.

Thanks a million again for all your work on this package.

T.

josef...@gmail.com

unread,

Jun 15, 2017, 10:46:00 AM6/15/17

to pystatsmodels

On Thu, Jun 15, 2017 at 5:32 AM, Tiarnán de Burca <tdeb...@gmail.com> wrote:
> Josef,
>
> Thanks a million for getting back to me.
>
>> However, we are not supporting survey weights yet. While I think the
>> above is correct and should work with WLS, I would recommend checking
>> against Stata or similar.
>
>
> It sounds a like statsmodel may not be the tool to do this work in, are
> there any other pythonic tools that allow for this type of work?

I never came across any other python package that would support survey weights.
If you need more survey support, then you need to switch currently to
R or Stata or other packages that have it.

(AFAIR, Stata's pweights are just regular weights plus robust
covariance, but the svy prefix has a lot more that is specific to
complex survey sampling and there I don't have a good enough overview
yet to make reliable statements.)

>
> I think the thing I'm trying to do is "OLS with Survey Weights", does that
> sound right?

Google finds some things for this search term, but none of the search
results (on the first page) were ever on my reading list.

Josef

Tiarnán de Burca

unread,

Jun 15, 2017, 1:34:28 PM6/15/17

to pystatsmodels

I never came across any other python package that would support survey weights.
If you need more survey support, then you need to switch currently to
R or Stata or other packages that have it.

In case someone searches the archives and needs an answer to the question, the solution I came to is to use scikit-Learn.

This appears to do what I need:

from sklearn import datasets, linear_model

regr = linear_model.LinearRegression()

regr.fit(X,y, sample_weight=weights)

zip(X.columns, regr.coef_)

Without the 'weight=weights' variable this calculates the same coefficients as statsmodels and so I assume is doing the right thing.

The documentation is here

Thanks again.

T.

josef...@gmail.com

unread,

Jun 15, 2017, 1:45:00 PM6/15/17

to pystatsmodels

I guess that this is doing the same as statsmodels WLS (but without
the cov_type option) and is not specific to survey weights either.

Josef

Renzo Massari

unread,

Sep 13, 2017, 12:14:58 AM9/13/17

to pystatsmodels

Tiarnan,

I quickly checked on the survey design for your data (https://www.gesis.org/eurobarometer-data-service/survey-series/standard-special-eb/sampling-and-fieldwork/ ), and I see it is stratified and clustered (as most complex surveys are these days).

So please note that, as josefpktd said, it is not enough to use weights in the probability weight sense and then robust error estimation: you need to model a sandwich estimator to take into account the strata and the clusters!

Not taking into account the clusters, specially, will tend to seriously underestimate your standard errors.

I have reached josefpktd to see if I can help working on that. In the meantime, just like he did, please use Stata or R survey, and please do model the survey design beyond weights: it is a common, but pretty serious error not to.