Following this thread, I'm posting a revised plain-text proposal for GSOC 2014 here. I'd be grateful for any thoughts/comments.
Objective
To provide regression methods for estimation and inference on dynamic linear panel data models in package statsmodels within the Python language.
Abstract
The aim of this project is to provide tools for estimation and inference of linear panel data regression models in Statsmodels, a BSD-licensed Python package for statistical modelling. Panel data are ubiquitous in natural and social sciences, and they have their own specific statistical regression methods that should be supported in Python. Statsmodels is currently in the advanced stages of developing tools to calculate so-called fixed and random effects estimators. However, researchers in practice would often like to include a lagged dependent variable in the estimating equation, for which fixed and random effects estimators will not in general be consistent1 in the presence of unobserved cross-sectional heterogeneity. This project will add support to Statsmodels for the consistent estimation of these ‘dynamic’ panel data regression models for large N and small T.
Proposal
Motivation
Panel data are data that are indexed by cross-section i = 1,.. . , N and time t = 1,. .. , T, and arise when we have repeated measurements on the same experimental units. These data are particularly useful because they allow the researcher to estimate relationships between y and x even in the presence of unobserved cross-sectional, but time-invariant, heterogeneity. It is always possible to estimate linear regressions by ignoring the bivariate index (i, t) and simply applying Ordinary Least Squares, and generally this “pooled OLS” estimator will be consistent when the error term at each (i, t), including any unobserved component, is uncorrelated with the explanatory variables at each (i, t). However, in most examples of practical interest the researcher will be concerned about correlation between the explanatory variables and the error term, which can give rise to serious inconsistencies in pooled OLS.
For example, in the context of comparative politics, we may like to estimate the effect of total election expenditure on voter turnout, but neglecting the correlation between total election expenditure and unobserved components in the regression error term leads us to overestimate the effect ([10]). In the context of macroeconomics, we may like to use measurements across countries and years to measure the effect of investment on economic growth, but we may be concerned that investment is correlated with unobserved country-specific factors like cultural and institutional characteristics, so that pooled OLS will inconsistently estimate the effect of investment on growth. Panel data also arise in clinical drug trials, programme evaluation, financial portfolios, sports management and weather. A good statistical reference is [11].
Existing functionality
Python supports some panel data regression through pandas and statsmodels. Through pandas, users have access to a Panel class for panel datasets, with methods for applying functions, reshaping, managing missing entries, merging, subsetting and more ([4]). The pandas.ols function allows estimation of coefficients and standard errors of linear panel regression with fixed effects (intercepts) for the cross-section (i) and time (t) dimensions.
Package statsmodels is a BSD-licensed Python package for statistical modelling, currently in stable version 0.5.0 ([7]). It supports some panel data regression in the sandbox, and has tools for fixed effects (between-/within-groups) and random effects estimation (GLS), with one- and two-way effects, in advanced stages of development ([9]). These estimation methods are relevant when we suspect the data generating process to contain unobserved cross-sectional, time-invariant heterogeneity, and they are examples of ‘static’ panel data models, where no lagged dependent variables appear as explanatory variables. The development code in statsmodels allows hypothesis testing of linear restrictions on coefficients in the estimated model (Wald test), and for the presence of an unobserved component in the error term (Hausman test).
In the context of static linear panel regression, Python and statsmodels are currently lacking hypothesis tests for heteroskedasticity and serial correlation, and for omitted non-linearities in the regression specification. Estimating these static linear panel models by First Differences is not yet supported, while estimation by Instrumental Variables and Generalised Method of Moments is limited ([8]).
Specifics
I propose to make the following contributions to the existing work on static linear panel regression, with comprehensive unit testing and documentation, to the development code in statsmodels:
· Hypothesis test for unobserved cross-sectional, time-invariant heterogeneity (applies to fixed effects estimation).
· Hypothesis tests (Lagrange Multiplier-type) for heteroskedasticity and serial correlation (applies to both fixed and random effects estimation).
· Estimation of panel data unobserved effect regression models by First Differencing.
I start my proposal with contributions to static models because I think these should be a priority for statsmodels. However, my expertise lies in dynamic linear panel models. I would like to extend the functionality of statsmodels by allowing for the estimation of linear panel data regression models with sequential moment restrictions. (Fixed and Random Effects estimators are inconsistent in models with unobserved components in the error terms where lagged dependent variables appear as explanatory variables.) This means adding support to statsmodels for:
· Estimation of panel linear regressions by linear Generalised Method of Moments (GMM); in particular
· The ‘difference’ GMM estimator of Arellano–Bond; and
· The ‘system’ GMM estimator of Blundell–Bond.
I propose to benchmark speed and accuracy of the static panel regression code by comparing to xtreg in STATA and lm in R. Although R offers package plm for estimating panel linear models, the best software for estimating dynamic panel regressions, in terms of speed, robustness and user-friendliness, is STATA’s xtabond2 function (see [5]). Hence xtabond2 would be a preferred benchmarking tool for dynamic linear panel code.
Timeline
Throughout this timeline, I will communicate regularly with mentors and the statsmodels mailing list, and I will publish bi-weekly blog updates of progress.
Before Week 1 It is expected that merging of the fixed and random effects estimators in PR1133 ([6]) will take place before Week 1 of GSOC 2014. I would contribute to this project to ensure that my contributions during GSOC can easily be merged into statsmodels by the end of the summer. At the same time, I would familiarise myself with the architecture of statsmodels.
Weeks 1-2 Add support for computation of test statistics and p-values associated with (i) the existence of unobserved cross-sectional, time-invariant heterogeneity and (ii) heteroskedasticity and serial correlation in the model error terms.
Weeks 3-4 Add support for a first-difference transform of the model equation. Calculate the first-difference least squares estimator and covariance matrix by applying (robust/non-robust) least squares to the first-differenced equation.
Weeks 5-7 Add support for estimation of dynamic linear panel regression models by linear Generalised Method of Moments, using the Arellano–Bond ‘difference’ estimator ([1]). This would involve careful construction of the instrument matrix, as well as 1-step and 2-step estimators and options to restrict the instrument set.
Week 8 Add support for Sargan/Hansen tests of overidentifying restrictions, including test statistics and p-values to be computed at estimation time.
Week 9-11 Extend the above linear Generalised Method of Moments estimation to include the ‘system’ estimator of Blundell–Bond ([3]).
Weeks 12-13 Ensure code is thoroughly tested and documented. Tie up loose ends.
References
[1] Arellano, M. and S. Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. The review of economic studies. 58(2): 277-297.
[2] Arellano, M. and O. Bover. Another look at the instrumental variable estimation of error-components models. Journal of econometrics. 68(1): 29-51.
[3] Blundell, R. and S. Bond. Initial conditions and moment restrictions in dynamic panel data models. Journal of econometrics. 87(1): 115-143.
[4] Pandas home page: http://pandas.pydata.org
[5] Roodman, D. 2009. How to do xtabond2: An introduction to difference and system GMM in Stata. The Stata Journal. 9(1): 86-136.
[6] Statsmodels Work In Progress discussion: https://github.com/statsmodels/statsmodels/pull/1133
[7] Statsmodels home page: http://statsmodels.sourceforge.net
[8] Statsmodels: Generalised Method of Moments http://statsmodels.sourceforge.net/devel/gmm.html
[9] Basic panel data models for statsmodels: https://gist.github.com/vincentarelbundock/5053686
[10] Wilson, S. and D. Butler. 2004. A lot more work to do: the promise and peril of panel data in Political Science. Stanford
[11] Wooldridge, J. 2001. Econometric Analysis of Cross-Section and Panel Data. The MIT Press: Cambridge MA.
Hi JosefThanks for the comments. I can benchmark against Ox xtabond too, which is supposedly fast. I have the code "DPD for Gauss", which would also be a natural benchmark. I must confess ignorance of GEE, but I can read enough to reference it in the proposal. 'Endogeneity' is the focus of Econometrics and I think the major feature differentiating it from Statistics, presumably because the Social Sciences are plagued with non-experimental data where the explanatory variables and error terms are jointly determined.
Many things could be proposed, but I think you're better placed than me to decide where a limited amount of time should be spent. For example, if you think I should ignore the static models, then more can be done/proposed on the dynamic models. Unless they're made an explicit goal, whether unbalanced panels will be covered is just a matter of how quickly things progress. I'd start with the lag difference estimator, but design code with other transformations in mind (e.g. forward orthogonal deviations).
I'm a PhD candidate in Economics at Oxford under the supervision of Steve Bond. I work on applied panel regression and a bit of econometric theory. I was Kevin's teaching assistant in 2012/2013, so that's how I know him. I've used R extensively for about 7 years, and I've worked in Eviews, Stata, Ox, VBA, VB.NET etc. I started Python about 2 years ago and I'm now comfortable in it, having used selenium for scraping websites, nltk for classification & sentiment analysis, random for simulating data compression algorithms with Monte Carlo (following this research: http://arxiv.org/abs/1304.0353), and taking edX 6.00.1x and 6.00.2x.
Hi Josef,Proposal is submitted on Melange with sections on GEE, GitHub account and link to a pull request. I'm happy to iterate further.
Hello,I am wondering what has happened to this GSOC.I am trying to run a dynamic panel regression using firm level investment expenditure data.Arellano and Bond seems to be a natural choice, and i haven't found any good python package for it yet.Thanks!
Oops, I should have linked to this paper (simpler version without the cross-lags):
On Friday, June 23, 2017 at 5:53:26 PM UTC-4, Damien Moore wrote:I've also been looking for something to handle dynamic panel regression. This paper looks interesting in terms of handling fixed effects models in unbalanced panels with lagged dependent variables:Is some like the pre-req, what they call linear structural equation models (SEM), available in statsmodels? (see p5)