Driscoll Kraay standard errors in RegressionResults

396 views
Skip to first unread message

Firdaus Janoos

unread,
Aug 7, 2015, 2:10:20 PM8/7/15
to pystatsmodels
Hello,

I have a panel dataset as pandas dataframe organized as:

time | individual |  response | predictor_1 | predictor_2 |
---------------------------------------------------------------------------------
t      |     i        |   y_t,i         |  x1_t,i       |   x2_t,i        |
...
and I've been using statsmodels.regression.linear_model.WLS to compute  $y_{i,t} ~ x1_{i,t} +  x2 _{i,t}$

I was interested in using the Driscoll Kraay method for computing standard errors in panel data with time-series auto-correlations (ie hac-groupsum option of RegressionResults.get_robustcov_results).

However, I am not sure how to setup the statsmodels.regression.linear_model.WLS such that the RegressionsResults object is aware of the panel structure (ie let it know the cross-sectional index and time-stamp of each row).

If you can point me to some example code on how to setup this problem - that would be greatly appreciated.

Thanks !







josef...@gmail.com

unread,
Aug 7, 2015, 2:26:18 PM8/7/15
to pystatsmodels
the standard pattern is now specifying it in the `fit` call

res = model.fit(cov_type='hac-groupsum', cov_kwds={'time': mytime_array, 'groups': mygroup_array})

with extra arguments in a dictionary cov_kwds



except looking for the code, the documentation and the code don't agree

In the code the `if` is for nw-groupsum


I got briefly worried because I didn't find `hac-groupsum' in any test module
It looks like the get_robustcov_results has unit tests but I didn't add them for the newer `fit` interface

res = model.fit(cov_type='nw-groupsum', cov_kwds={'time': mytime_array, 'groups': mygroup_array})
is supposed to work

Josef


 

Thanks !








Charles Martineau

unread,
Aug 12, 2015, 4:45:11 PM8/12/15
to pystatsmodels
Dear Josef,

I am also trying to compute the Driscoll-Kraay standard errors. But I always get a MemoryError issue.

For instance:

index                                       Y                   X1                X2       ...... X17     GroupID
2012-01-25 12:30:00  -1.809030        2.126177        0.522877                       1
2012-01-25 12:31:00  -0.434571       -1.809030        2.126177                       1
2012-01-25 12:32:00   0.500806       -0.434571       -1.809030                       1 
2012-01-25 12:33:00  -0.877922        0.500806       -0.434571                       1
2012-01-25 12:34:00   0.427819       -0.877922        0.500806                       1

The data is of length 1410 by 17. I have four groups so GroupID goes from 1 to 4. 

Now if I try the following:

time = [(t-datetime.datetime(1970,1,1)).total_seconds() for t in df.index]  # convert my time index to number of seconds 
res = sm.OLS(df.Y,  df.X).fit(cov_type='nw-groupsum',  cov_kwds={'time': time, 'groups': np.array(df.GroupID), 'maxlags': 5})

I get this error:

Traceback (most recent call last):

  File "<ipython-input-81-be983d62f538>", line 4, in <module>
    'groups': np.array(dec_all.Pid), 'maxlags':1})

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 211, in fit
    cov_type=cov_type, cov_kwds=cov_kwds, use_t=use_t)

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 1099, in __init__
    use_t=use_t, **cov_kwds)

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 1873, in get_robustcov_results
    use_correction=use_correction)

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\stats\sandwich_covariance.py", line 871, in cov_nw_groupsum
    S_hac = S_hac_groupsum(xu, time, nlags=nlags, weights_func=weights_func)

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\stats\sandwich_covariance.py", line 477, in S_hac_groupsum
    x_group_sums = group_sums(x, time).T #TODO: transpose return in grou_sum

  File "C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\stats\sandwich_covariance.py", line 437, in group_sums
    for col in range(x.shape[1])])

MemoryError


What am I doing wrong? Thanks Josef

Charles Martineau

unread,
Aug 12, 2015, 4:45:58 PM8/12/15
to pystatsmodels
Oh I must had that I regress Y on 16 X variables.

josef...@gmail.com

unread,
Aug 12, 2015, 4:54:39 PM8/12/15
to pystatsmodels
On Wed, Aug 12, 2015 at 4:45 PM, Charles Martineau <martinea...@gmail.com> wrote:
Oh I must had that I regress Y on 16 X variables.


On Wednesday, August 12, 2015 at 1:45:11 PM UTC-7, Charles Martineau wrote:
Dear Josef,

I am also trying to compute the Driscoll-Kraay standard errors. But I always get a MemoryError issue.

For instance:

index                                       Y                   X1                X2       ...... X17     GroupID
2012-01-25 12:30:00  -1.809030        2.126177        0.522877                       1
2012-01-25 12:31:00  -0.434571       -1.809030        2.126177                       1
2012-01-25 12:32:00   0.500806       -0.434571       -1.809030                       1 
2012-01-25 12:33:00  -0.877922        0.500806       -0.434571                       1
2012-01-25 12:34:00   0.427819       -0.877922        0.500806                       1

The data is of length 1410 by 17. I have four groups so GroupID goes from 1 to 4. 

Now if I try the following:

time = [(t-datetime.datetime(1970,1,1)).total_seconds() for t in df.index]  # convert my time index to number of seconds 

What's `time.max()`?
Can you try to convert time to a consecutive integer index corresponding to np.arange(n_time_points)?

A memory error sounds bad. I can look at the code later (tonight or tomorrow).
One guess is that I use np.bincount which creates an array of length time.max(). (We fixed a similar problem in an unrelated part of the code.)

Overall there are a lot of assumptions on the structure of the data and arrays in these parts and not enough checking.

Josef

Charles Martineau

unread,
Aug 12, 2015, 5:30:01 PM8/12/15
to pystatsmodels
dear Josef,

yes you are right, a simple np.arange(n_time_points) fixed the issue.

thank you
Reply all
Reply to author
Forward
0 new messages