GSOC14: Dynamic linear panel data regression

1,528 views
Skip to first unread message

Galen

unread,
Mar 18, 2014, 7:27:54 PM3/18/14
to pystat...@googlegroups.com

Following this thread, I'm posting a revised plain-text proposal for GSOC 2014 here. I'd be grateful for any thoughts/comments.

Objective

To provide regression methods for estimation and inference on dynamic linear panel data models in package statsmodels within the Python language.

Abstract

The aim of this project is to provide tools for estimation and inference of linear panel data regression models in Statsmodels, a BSD-licensed Python package for statistical modelling. Panel data are ubiquitous in natural and social sciences, and they have their own specific statistical regression methods that should be supported in Python. Statsmodels is currently in the advanced stages of developing tools to calculate so-called fixed and random effects estimators. However, researchers in practice would often like to include a lagged dependent variable in the estimating equation, for which fixed and random effects estimators will not in general be consistent1 in the presence of unobserved cross-sectional heterogeneity. This project will add support to Statsmodels for the consistent estimation of these ‘dynamic’ panel data regression models for large N and small T.

Proposal

Motivation

Panel data are data that are indexed by cross-section i = 1,.. . , N and time t = 1,. .. , T, and arise when we have repeated measurements on the same experimental units. These data are particularly useful because they allow the researcher to estimate relationships between y and x even in the presence of unobserved cross-sectional, but time-invariant, heterogeneity. It is always possible to estimate linear regressions by ignoring the bivariate index (i, t) and simply applying Ordinary Least Squares, and generally this “pooled OLS” estimator will be consistent when the error term at each (i, t), including any unobserved component, is uncorrelated with the explanatory variables at each (i, t). However, in most examples of practical interest the researcher will be concerned about correlation between the explanatory variables and the error term, which can give rise to serious inconsistencies in pooled OLS.

For example, in the context of comparative politics, we may like to estimate the effect of total election ex­penditure on voter turnout, but neglecting the correlation between total election expenditure and unobserved components in the regression error term leads us to overestimate the effect ([10]). In the context of macroe­conomics, we may like to use measurements across countries and years to measure the effect of investment on economic growth, but we may be concerned that investment is correlated with unobserved country-specific factors like cultural and institutional characteristics, so that pooled OLS will inconsistently estimate the effect of investment on growth. Panel data also arise in clinical drug trials, programme evaluation, financial portfolios, sports management and weather. A good statistical reference is [11].



Existing functionality

Python supports some panel data regression through pandas and statsmodels. Through pandas, users have access to a Panel class for panel datasets, with methods for applying functions, reshaping, managing missing entries, merging, subsetting and more ([4]). The pandas.ols function allows estimation of coefficients and standard errors of linear panel regression with fixed effects (intercepts) for the cross-section (i) and time (t) dimensions.

Package statsmodels is a BSD-licensed Python package for statistical modelling, currently in stable version 0.5.0 ([7]). It supports some panel data regression in the sandbox, and has tools for fixed effects (between-/within-groups) and random effects estimation (GLS), with one- and two-way effects, in advanced stages of development ([9]). These estimation methods are relevant when we suspect the data generating process to contain unobserved cross-sectional, time-invariant heterogeneity, and they are examples of ‘static’ panel data models, where no lagged dependent variables appear as explanatory variables. The development code in statsmodels allows hypothesis testing of linear restrictions on coefficients in the estimated model (Wald test), and for the presence of an unobserved component in the error term (Hausman test).

In the context of static linear panel regression, Python and statsmodels are currently lacking hypothesis tests for heteroskedasticity and serial correlation, and for omitted non-linearities in the regression specifica­tion. Estimating these static linear panel models by First Differences is not yet supported, while estimation by Instrumental Variables and Generalised Method of Moments is limited ([8]).

Specifics

I propose to make the following contributions to the existing work on static linear panel regression, with comprehensive unit testing and documentation, to the development code in statsmodels:

·    Hypothesis test for unobserved cross-sectional, time-invariant heterogeneity (applies to fixed effects estimation).

·    Hypothesis tests (Lagrange Multiplier-type) for heteroskedasticity and serial correlation (applies to both fixed and random effects estimation).

·    Estimation of panel data unobserved effect regression models by First Differencing.

I start my proposal with contributions to static models because I think these should be a priority for statsmodels. However, my expertise lies in dynamic linear panel models. I would like to extend the functionality of statsmodels by allowing for the estimation of linear panel data regression models with sequential moment restrictions. (Fixed and Random Effects estimators are inconsistent in models with un­observed components in the error terms where lagged dependent variables appear as explanatory variables.) This means adding support to statsmodels for:

·    Estimation of panel linear regressions by linear Generalised Method of Moments (GMM); in particular

·    The ‘difference’ GMM estimator of Arellano–Bond; and

·    The ‘system’ GMM estimator of Blundell–Bond.

I propose to benchmark speed and accuracy of the static panel regression code by comparing to xtreg in STATA and lm in R. Although R offers package plm for estimating panel linear models, the best software for estimating dynamic panel regressions, in terms of speed, robustness and user-friendliness, is STATA’s xtabond2 function (see [5]). Hence xtabond2 would be a preferred benchmarking tool for dynamic linear panel code.

Timeline

Throughout this timeline, I will communicate regularly with mentors and the statsmodels mailing list, and I will publish bi-weekly blog updates of progress.


Before Week 1 It is expected that merging of the fixed and random effects estimators in PR1133 ([6]) will take place before Week 1 of GSOC 2014. I would contribute to this project to ensure that my contributions during GSOC can easily be merged into statsmodels by the end of the summer. At the same time, I would familiarise myself with the architecture of statsmodels.

Weeks 1-2 Add support for computation of test statistics and p-values associated with (i) the existence of unobserved cross-sectional, time-invariant heterogeneity and (ii) heteroskedasticity and serial correla­tion in the model error terms.

Weeks 3-4 Add support for a first-difference transform of the model equation. Calculate the first-difference least squares estimator and covariance matrix by applying (robust/non-robust) least squares to the first-differenced equation.

Weeks 5-7 Add support for estimation of dynamic linear panel regression models by linear Generalised Method of Moments, using the Arellano–Bond ‘difference’ estimator ([1]). This would involve careful construction of the instrument matrix, as well as 1-step and 2-step estimators and options to restrict the instrument set.

Week 8 Add support for Sargan/Hansen tests of overidentifying restrictions, including test statistics and p-values to be computed at estimation time.

Week 9-11 Extend the above linear Generalised Method of Moments estimation to include the ‘system’ estimator of Blundell–Bond ([3]).

Weeks 12-13 Ensure code is thoroughly tested and documented. Tie up loose ends.

References

[1]  Arellano, M. and S. Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. The review of economic studies. 58(2): 277-297.

[2]  Arellano, M. and O. Bover. Another look at the instrumental variable estimation of error-components models. Journal of econometrics. 68(1): 29-51.

[3]  Blundell, R. and S. Bond. Initial conditions and moment restrictions in dynamic panel data models. Journal of econometrics. 87(1): 115-143.

[4]  Pandas home page: http://pandas.pydata.org

[5]  Roodman, D. 2009. How to do xtabond2: An introduction to difference and system GMM in Stata. The Stata Journal. 9(1): 86-136.

[6]  Statsmodels Work In Progress discussion: https://github.com/statsmodels/statsmodels/pull/1133

[7]  Statsmodels home page: http://statsmodels.sourceforge.net

[8]  Statsmodels: Generalised Method of Moments http://statsmodels.sourceforge.net/devel/gmm.html

[9]  Basic panel data models for statsmodels: https://gist.github.com/vincentarelbundock/5053686

[10]      Wilson, S. and D. Butler. 2004. A lot more work to do: the promise and peril of panel data in Political Science. Stanford

[11]      Wooldridge, J. 2001. Econometric Analysis of Cross-Section and Panel Data. The MIT Press: Cambridge MA.

josef...@gmail.com

unread,
Mar 18, 2014, 8:27:50 PM3/18/14
to pystatsmodels
Hi Galen,

Thanks for the updated proposal.

Based on a quick reading I think this is very good now.

I think the time line looks fine, with the usual qualifier that some things might turn out to be easier and some things more time consuming.
At the last part it might be important to finish up the simpler versions, rather than to half finish several general versions.

I read the description for the Ox equivalent of xtabond a while ago, but I skipped the panel system estimator section in Stata's gmm manual in my latest rewrite of GMM.
As I mentioned before, I think the indexing problems in the system estimators are a bit "scary", and I would be glad to have a dynamic panel expert in statsmodels to cover this area.

About the first part: Currently we don't have any diagnostic tests related to panel models (we don't have the panel models themselves either).  
My guess is that there it is the opposite of xtabond in terms of implementation: it might be just a few lines of code each, but you have to know which. I have only a vague idea about the diagnostic tests for panel data based on reading the abstract of several articles. 
How time consuming this will be depends pretty much on the availability of clear references.

I have to read the proposal again more carefully and think about it, then I will have more question.

Josef

josef...@gmail.com

unread,
Mar 18, 2014, 9:46:38 PM3/18/14
to pystatsmodels
Hi Galen,

Can you briefly introduce yourself? 
From the university page I can see that Bond is your faculty advisor. 
What's your background in Python? 

About the proposal I have only one clarifying question, that need not be changed in the proposal, first line of motivation. 
Is the plan only for balanced panel or also for unbalanced panel?

For the difference estimator, do you plan on just the lag difference transform or also the forward transform or similar?


some general comments:

I'm reading the proposal as an econometrician assuming we mean the same thing by the names of the different estimators, and by what implementing models like those in xtabond2.

What I would briefly mention in the proposal is that statsmodels is also getting similar features from the statistics side for the analysis of panel/longitudinal data. 
Generalized Estimating Equations, GEE, is already in master, mixed effects models are in a pull request that will also be merged soon. Both allow for different, more flexible kinds of correlation structures than the models from the econometrics side in the proposal, however they assume that the explanatory variables are strictly exogenous (*), the unobserved error term is uncorrelated with or independent of any of the explanatory variables, for example no lagged dependent variables are allowed, unobserved heterogeneity cannot be correlated with any explanatory variables. (assuming I remember the assumptions correctly)

GEE is similar to GMM with modeled correlation within a panel unit but no correlation across panel units
Mixed effects models are similar to random coefficient models in econometrics

GEE also covers other models than linear models, Logit, Poisson, ...

I think the two side complement each other very well in covering the panel/longitudinal use cases, although there is some language barrier.

-----
(*) oh no, not that word 
(sorry, insider joke)

Josef

Galen

unread,
Mar 19, 2014, 12:19:51 PM3/19/14
to pystat...@googlegroups.com
Hi Josef

Thanks for the comments. I can benchmark against Ox xtabond too, which is supposedly fast. I have the code "DPD for Gauss", which would also be a natural benchmark. I must confess ignorance of GEE, but I can read enough to reference it in the proposal. 'Endogeneity' is the focus of Econometrics and I think the major feature differentiating it from Statistics, presumably because the Social Sciences are plagued with non-experimental data where the explanatory variables and error terms are jointly determined. 

Many things could be proposed, but I think you're better placed than me to decide where a limited amount of time should be spent. For example, if you think I should ignore the static models, then more can be done/proposed on the dynamic models. Unless they're made an explicit goal, whether unbalanced panels will be covered is just a matter of how quickly things progress. I'd start with the lag difference estimator, but design code with other transformations in mind (e.g. forward orthogonal deviations).

I'm a PhD candidate in Economics at Oxford under the supervision of Steve Bond. I work on applied panel regression and a bit of econometric theory. I was Kevin's teaching assistant in 2012/2013, so that's how I know him. I've used R extensively for about 7 years, and I've worked in Eviews, Stata, Ox, VBA, VB.NET etc. I started Python about 2 years ago and I'm now comfortable in it, having used selenium for scraping websites, nltk for classification & sentiment analysis, random for simulating data compression algorithms with Monte Carlo (following this research: http://arxiv.org/abs/1304.0353), and taking edX 6.00.1x and 6.00.2x.

Best wishes
Galen

josef...@gmail.com

unread,
Mar 19, 2014, 1:48:07 PM3/19/14
to pystatsmodels
On Wed, Mar 19, 2014 at 12:19 PM, Galen <gale...@gmail.com> wrote:
Hi Josef

Thanks for the comments. I can benchmark against Ox xtabond too, which is supposedly fast. I have the code "DPD for Gauss", which would also be a natural benchmark. I must confess ignorance of GEE, but I can read enough to reference it in the proposal. 'Endogeneity' is the focus of Econometrics and I think the major feature differentiating it from Statistics, presumably because the Social Sciences are plagued with non-experimental data where the explanatory variables and error terms are jointly determined. 

Wooldridge Econometric Analysis of Cross Section and Panel Data has some explanations of GEE from an econometrics viewpoint, in chapter 13 and some other places.

I never used OX, but I was reading several papers by Doornik because he usually explains more implementation details than many other articles.

 
Many things could be proposed, but I think you're better placed than me to decide where a limited amount of time should be spent. For example, if you think I should ignore the static models, then more can be done/proposed on the dynamic models. Unless they're made an explicit goal, whether unbalanced panels will be covered is just a matter of how quickly things progress. I'd start with the lag difference estimator, but design code with other transformations in mind (e.g. forward orthogonal deviations).

I prefer if most time is spent on the dynamic parts, but I still think it's useful if you work on the static part first, it has the advantage for statsmodels to get another set of eyes on the code and to get additional functionality, and you would have some additional time to familiarize with the, at that time, existing approach before plunging into the new models.
 

I'm a PhD candidate in Economics at Oxford under the supervision of Steve Bond. I work on applied panel regression and a bit of econometric theory. I was Kevin's teaching assistant in 2012/2013, so that's how I know him. I've used R extensively for about 7 years, and I've worked in Eviews, Stata, Ox, VBA, VB.NET etc. I started Python about 2 years ago and I'm now comfortable in it, having used selenium for scraping websites, nltk for classification & sentiment analysis, random for simulating data compression algorithms with Monte Carlo (following this research: http://arxiv.org/abs/1304.0353), and taking edX 6.00.1x and 6.00.2x.

sounds very good, and an interesting read (fast skimming)

You still need to submit your application an proposal to melange.
Do you already have a github account?
Do you already have plans for the required pull request or code sample?

Thanks,

Josef

Galen

unread,
Mar 20, 2014, 8:48:07 AM3/20/14
to pystat...@googlegroups.com
Hi Josef,

Proposal is submitted on Melange with sections on GEE, GitHub account and link to a pull request. I'm happy to iterate further.

Best wishes
Galen

josef...@gmail.com

unread,
Mar 20, 2014, 9:27:01 AM3/20/14
to pystatsmodels
On Thu, Mar 20, 2014 at 8:48 AM, Galen <gale...@gmail.com> wrote:
Hi Josef,

Proposal is submitted on Melange with sections on GEE, GitHub account and link to a pull request. I'm happy to iterate further.

The proposal looks good, I don't see any urgent need to iterate right now. I need to look at other things today
The random (?) changes in fonts in the proposal in Melange are a bit distracting.

I didn't know your github name.  PR #1491 should be ready to merge soon, but I need to read through it again. 

Galen

unread,
Mar 20, 2014, 10:53:05 AM3/20/14
to pystat...@googlegroups.com
With PR #1491, I'm happy with acovf if you are, but I still need to extend the same principles to acf and pacf...

Agree the font changes are brutal! 

Hao Summer

unread,
Dec 6, 2016, 9:15:52 AM12/6/16
to pystatsmodels
Hello,
I am wondering what has happened to this GSOC. 
I am trying to run a dynamic panel regression using firm level investment expenditure data. 
Arellano and Bond seems to be a natural choice, and i haven't found any good python package for it yet.
Thanks!

josef...@gmail.com

unread,
Dec 6, 2016, 9:28:41 AM12/6/16
to pystatsmodels
On Tue, Dec 6, 2016 at 9:04 AM, Hao Summer <hao.s...@gmail.com> wrote:
Hello,
I am wondering what has happened to this GSOC. 
I am trying to run a dynamic panel regression using firm level investment expenditure data. 
Arellano and Bond seems to be a natural choice, and i haven't found any good python package for it yet.
Thanks!

I have not seen any python package for it. Even if you know of a draft version in python, then that would be helpful.

Unfortunately for us, we lost the student to an internship that we couldn't compete with. So nothing has happened in this area.

(I'm thinking about it every once in a while, but I'm not sure yet how to handle unbalanced, stacked panel data in GMM.)

Josef

Damien Moore

unread,
Jun 23, 2017, 5:53:26 PM6/23/17
to pystatsmodels
I've also been looking for something to handle dynamic panel regression. This paper looks interesting in terms of handling fixed effects models in unbalanced panels with lagged dependent variables:


Is some like the pre-req, what they call linear structural equation models (SEM), available in statsmodels? (see p5)

Damien Moore

unread,
Jun 23, 2017, 6:04:18 PM6/23/17
to pystatsmodels
Oops, I should have linked to this paper (simpler version without the cross-lags):

josef...@gmail.com

unread,
Jun 24, 2017, 5:49:10 AM6/24/17
to pystatsmodels
On Fri, Jun 23, 2017 at 6:04 PM, Damien Moore <damien...@gmail.com> wrote:
Oops, I should have linked to this paper (simpler version without the cross-lags):




On Friday, June 23, 2017 at 5:53:26 PM UTC-4, Damien Moore wrote:
I've also been looking for something to handle dynamic panel regression. This paper looks interesting in terms of handling fixed effects models in unbalanced panels with lagged dependent variables:


Is some like the pre-req, what they call linear structural equation models (SEM), available in statsmodels? (see p5)


Thanks for the reference.

We don't have anything like a general SEM implementation yet, and I don't know of any other implementation in Python.
We have an old PR for economterics system of equations like 3 stage least squares, but AFAIR nothing with full MLE.

There are no immediate plans for implementing SEM in statsmodels, at least I don't know of anybody working on it. It is a good topic and the size appropriate for a future GSOC project. But the selection of GSOC topics is largely driven by the interest and availability of students.

It looks like we don't have a github issue for dynamic panel data.

Josef
Reply all
Reply to author
Forward
0 new messages