Request: Model selection of Generalized Estimating Equations (GEE)

Robin Kramer

unread,

Nov 22, 2016, 10:20:20 AM11/22/16

to pystatsmodels

After having learnt R, I've been using Python as my main tool for data analysis and I've been loving it so far. However, since I've started using statistical methods that are just a little more complex, I encountered the limits of it: Much of the complex methods that are in R, have not been implemented in statsmodels yet :(

I am missing one method in particular. After performing some GEE's, I've got several models of which I can tell which independent variables have a significant contribution to the prediction of the dependent variable. However, what I do not know and cannot find is which model is better than to other. For Linear Mixed Effect Models there are AIC, BIC and log-likelihood methods, but these cannot be used for GEE (https://onlinecourses.science.psu.edu/stat504/node/180). There is method called Quasi-likelihood under the independence model criterion (QIC; http://www.jstatsoft.org/v57/c01/paper), which is implemented in R (http://stats.stackexchange.com/questions/21771/how-to-perform-model-selection-in-gee-in-r), but again, not in Python.

Is it possible to have the QIC method implemented in Python too? The linked paper describes how to calculate the QIC rather briefly, so I hope it won't be too difficult. Hope to hear from you!

v57c01.pdf

josef...@gmail.com

unread,

Nov 22, 2016, 4:10:23 PM11/22/16

to pystatsmodels

sounds very popular, There is also a user contributed Stata version. So we need it also.

We don't have a quasi-likelihood attached to our families, AFAIR, only the full likelihood. So, that might be a missing piece for a quick implementation from the outside of the models.

However, I'm not sure what's the background for this. Skimming the Pan 2001 article, it sounds a bit "shaky" to me. Quasi-likelihood mainly adds dispersion, and if there are any other deviation from the "true" likelihood model, as assumed in GEE, then comparing an objective function across models doesn't have any standard/nice properties.

But I never looked at AIC/BIC alternatives if we don't have at least a likelihood interpretation.

There might be some selection criteria based on predictive accuracy. For parameter estimation and selection of variables, I would rely more on the score tests which we have built into GEE (score tests in GEE are a bit conservative, as far as I remember).

Or, my intuition is wrong.

Josef

josef...@gmail.com

unread,

Nov 22, 2016, 8:42:16 PM11/22/16

to pystatsmodels

On Tue, Nov 22, 2016 at 4:10 PM, <josef...@gmail.com> wrote:

On Tue, Nov 22, 2016 at 10:04 AM, Robin Kramer <kramer...@gmail.com> wrote:
After having learnt R, I've been using Python as my main tool for data analysis and I've been loving it so far. However, since I've started using statistical methods that are just a little more complex, I encountered the limits of it: Much of the complex methods that are in R, have not been implemented in statsmodels yet :(

to add a few comments to this:

As in the discussion "Python versus R" that has mostly shifted to "Python and R", it should be clear that there are no (!?) other packages that cover as wide a range of methods in statistics as R. (usability and consistency across packages is a different issue)

For example for (outlier) robust estimation we just have the basic M-estimators, and scikit-learn has some things, while in R many of the "big" guys and several dedicated R developers (package maintainers) have collaborated for 10 to 20 years.

So we try to get the basic tools plus pretty good coverage in some areas where some developers are more interested in. Statespace models is currently one of those. GLM and GEE are in pretty good shape, so it's worth thinking about what parts are still missing,

for GLM:

https://github.com/statsmodels/statsmodels/issues/2804

https://github.com/statsmodels/statsmodels/issues/2753

We have got some of those since those issues where opened, largely because of the work of thequackdaddy.

Having good issues and wishlists for those areas are useful so we see what's missing and what should be the priorities. Implementing it might still take time if nobody is interested enough to work on it.

In my personal interest I often get stuck in generic methods and reusable tools that are missing to expand in several models. Sandwich covariance is one that has been pretty successful, adding weights to all models would be another great thing that will open up a many new applications for the existing models. We also need better and more flexible covariance matrix estimators in general to be plugged in in several places. Another one that has been on my wishlist for a long time are generic diagnostic measures and hypothesis tests that can be plugged into many models instead of having them just for OLS as it is now.

So, I'm always happy to see contributions for specific methods as counterpoint to me getting lost in generic or general solutions.

Josef

https://fr.wiktionary.org/wiki/besser_ein_Spatz_in_der_Hand_als_eine_Taube_auf_dem_Dach

josef...@gmail.com

unread,

Nov 22, 2016, 9:34:16 PM11/22/16

to pystatsmodels

On Tue, Nov 22, 2016 at 8:42 PM, <josef...@gmail.com> wrote:

On Tue, Nov 22, 2016 at 4:10 PM, <josef...@gmail.com> wrote:

On Tue, Nov 22, 2016 at 10:04 AM, Robin Kramer <kramer...@gmail.com> wrote:
After having learnt R, I've been using Python as my main tool for data analysis and I've been loving it so far. However, since I've started using statistical methods that are just a little more complex, I encountered the limits of it: Much of the complex methods that are in R, have not been implemented in statsmodels yet :(

to add a few comments to this:

As in the discussion "Python versus R" that has mostly shifted to "Python and R", it should be clear that there are no (!?) other packages that cover as wide a range of methods in statistics as R. (usability and consistency across packages is a different issue)

For example for (outlier) robust estimation we just have the basic M-estimators, and scikit-learn has some things, while in R many of the "big" guys and several dedicated R developers (package maintainers) have collaborated for 10 to 20 years.

So we try to get the basic tools plus pretty good coverage in some areas where some developers are more interested in. Statespace models is currently one of those. GLM and GEE are in pretty good shape, so it's worth thinking about what parts are still missing,
for GLM:
https://github.com/statsmodels/statsmodels/issues/2804
https://github.com/statsmodels/statsmodels/issues/2753
We have got some of those since those issues where opened, largely because of the work of thequackdaddy.

Having good issues and wishlists for those areas are useful so we see what's missing and what should be the priorities. Implementing it might still take time if nobody is interested enough to work on it.

In my personal interest I often get stuck in generic methods and reusable tools that are missing to expand in several models. Sandwich covariance is one that has been pretty successful, adding weights to all models would be another great thing that will open up a many new applications for the existing models. We also need better and more flexible covariance matrix estimators in general to be plugged in in several places. Another one that has been on my wishlist for a long time are generic diagnostic measures and hypothesis tests that can be plugged into many models instead of having them just for OLS as it is now.
So, I'm always happy to see contributions for specific methods as counterpoint to me getting lost in generic or general solutions.

Josef
https://fr.wiktionary.org/wiki/besser_ein_Spatz_in_der_Hand_als_eine_Taube_auf_dem_Dach

I am missing one method in particular. After performing some GEE's, I've got several models of which I can tell which independent variables have a significant contribution to the prediction of the dependent variable. However, what I do not know and cannot find is which model is better than to other. For Linear Mixed Effect Models there are AIC, BIC and log-likelihood methods, but these cannot be used for GEE (https://onlinecourses.science.psu.edu/stat504/node/180). There is method called Quasi-likelihood under the independence model criterion (QIC; http://www.jstatsoft.org/v57/c01/paper), which is implemented in R (http://stats.stackexchange.com/questions/21771/how-to-perform-model-selection-in-gee-in-r), but again, not in Python.

Is it possible to have the QIC method implemented in Python too? The linked paper describes how to calculate the QIC rather briefly, so I hope it won't be too difficult. Hope to hear from you!

sounds very popular, There is also a user contributed Stata version. So we need it also.

We don't have a quasi-likelihood attached to our families, AFAIR, only the full likelihood. So, that might be a missing piece for a quick implementation from the outside of the models.

GEE has the family attached which can calculate a quasi-loglikelihood (including the normalizing constant) if the scale argument is provided. So this should be readily available also. (I'm not sure whether the scale argument, different from default, is used yet in any code and has unit tests.)

The independence case could also be calculated using GLM instead of GEE, if the sandwich covariance is not required.

Josef

Reply all

Reply to author

Forward