How to use cov_struct.Autoregressive in GEE models

414 views
Skip to first unread message

VincentAB

unread,
Feb 19, 2015, 11:30:04 AM2/19/15
to pystat...@googlegroups.com
Hi all,

I'm trying to estimate a negative binomial GEE model with ar1 covariance structure, but I'm not sure how the "time" argument/matrix/index should constructed.  This is my attempt to replicate results obtained by typing something like this into stata:

xtset unit time
xtgee y x1 x2, family(nbinomial) corr(ar1) robust

where unit is a column of character strings, and time is a column of integers which refer to periods of observation.

Ref: http://www.stata.com/manuals13/xtxtgee.pdf

I read the docs on cov_struct.Autoregressive, but still can't figure out how I convert the unit and time columns of my dataframe to proper input for the "time" argument  of smf.gee. Any hint would be appreciated!

I'll write up an example notebook once I figure out how this works.

Thanks!

Vincent

josef...@gmail.com

unread,
Feb 19, 2015, 11:47:15 AM2/19/15
to pystatsmodels
On Thu, Feb 19, 2015 at 11:30 AM, VincentAB <vincen...@gmail.com> wrote:
> Hi all,
>
> I'm trying to estimate a negative binomial GEE model with ar1 covariance
> structure, but I'm not sure how the "time" argument/matrix/index should
> constructed. This is my attempt to replicate results obtained by typing
> something like this into stata:
>
> xtset unit time
> xtgee y x1 x2, family(nbinomial) corr(ar1) robust
>
> where unit is a column of character strings, and time is a column of
> integers which refer to periods of observation.
>
> Ref: http://www.stata.com/manuals13/xtxtgee.pdf
>
> I read the docs on cov_struct.Autoregressive, but still can't figure out how
> I convert the unit and time columns of my dataframe to proper input for the
> "time" argument of smf.gee. Any hint would be appreciated!

Did you try just integer time indicating the period for each
observation of an individual?
(stack all time indices with a dtype where calculations make sense)


AR is defined for continuous time on the distance t_ij - t_i'j for individual j

If time is not specified, it defaults to np.arange(nobs_i)

Josef

Vincent Arel

unread,
Feb 19, 2015, 1:33:25 PM2/19/15
to pystat...@googlegroups.com
Sorry, I think I'm having a bad day; still can't figure it out. Let's try with a reproducible example:

import pandas as pd
import statsmodels.formula.api as smf
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv'
dat = pd.read_csv(url)

fam = sm.families.NegativeBinomial()
ar1 = sm.cov_struct.Autoregressive()
mod = smf.gee('wage ~ capital + emp', groups=dat.firm, data=dat,
        cov_struct=ar1, family=fam)
mod.fit()

This produces a "ValueError: Autoregressive: unable to find right bracket"

My assumption is that this is a problem with my specification of "groups" in the smf.gee, or "time" in sm.GEE. Right?

Vincent

josef...@gmail.com

unread,
Feb 19, 2015, 1:53:17 PM2/19/15
to pystatsmodels
The model looks correctly defined to me.

you have 140 firms, which are internally split into lists of arrays

>>> len(mod.exog_li)
140
>>> mod.endog_li[0]
array([ 13.1516, 12.3018, 12.8395, 13.8039, 14.2897, 14.8681, 13.7784])
>>> dat[:8]
Unnamed: 0 firm year sector emp wage capital output
0 1 1 1977 7 5.041 13.1516 0.5894 95.707199
1 2 1 1978 7 5.600 12.3018 0.6318 97.356903
2 3 1 1979 7 5.015 12.8395 0.6771 99.608299
3 4 1 1980 7 4.715 13.8039 0.6171 100.550100
4 5 1 1981 7 4.093 14.2897 0.5076 99.558098
5 6 1 1982 7 3.166 14.8681 0.4229 98.615097
6 7 1 1983 7 2.936 13.7784 0.3920 100.030100
7 8 2 1977 7 71.319 14.7909 16.9363 95.707199
>>>


The problem is somewhere in the updating of the AR parameter, not with
the model AFAICS

Josef


>
> Vincent
>

Vincent Arel

unread,
Feb 19, 2015, 2:01:24 PM2/19/15
to pystat...@googlegroups.com
Good to know that the syntax is OK and that the model seems properly set up. FWIW, I get the same error when using my own data. Was I just unlucky to find two "problematic datasets" (EmplUK and my own), or do you think I should be looking for a bug in cov_struct.Autoregressive.update?

vab

josef...@gmail.com

unread,
Feb 19, 2015, 2:08:11 PM2/19/15
to pystatsmodels
I was looking at it, it uses brentq for finding the ar parameter, but
cannot find a bounding bracket.
My computer just crashed a short time ago, and I haven't managed to
look into it with pdb yet.

One possibility is that there is some stationarity assumption built
in, but only Kerby knows the details.
`if b_rgt > 1. - 1e-6: raise` but I haven't figured out yet what the
different names mean.

Josef


>
> vab

Vincent Arel

unread,
Feb 19, 2015, 2:32:18 PM2/19/15
to pystat...@googlegroups.com
Thanks for looking at it. I tried a few other datasets (some with slightly more observations per unit), and I hit another error related to the Brent bracket: ValueError("Not a bracketing interval.")

I'm afraid I don't know enough about Kerby's work on this to diagnose. Too bad...

Thanks again!

vab

josef...@gmail.com

unread,
Feb 19, 2015, 2:36:50 PM2/19/15
to pystatsmodels
Blue screens are back, and I will have to do some maintenance (and wait for my new computer to arrive)

Before the crash, I was trying to add a trending variable, year, to the regressors so the residuals are more stationary, but it didn't help.

Josef
A tablet is not a computer for working.

josef...@gmail.com

unread,
Feb 20, 2015, 7:30:06 AM2/20/15
to pystatsmodels
stationarity is assumed by construction and the variance calculations.
Here ar is just a correlation pattern
It's irrelevant for the problem here.

In this example, I switched to Poisson,
the bracket search doesn't find the area where the minimum is, the
valley is "too short"

I get good estimates with just the simplest bracket
self.dep_params = brent(fitfunc, brack=(0,1))

>>> mod.cov_struct.dep_params
0.92359832460800517

I also tried zscoring the exog, because initially I got huge numbers
for the standardized residuals (with unscaled year), but it didn't
have an effect for finding the autocorrelation.

We need an open issue and rewrite or safeguard the bracketing code for AR.

Josef

Kerby Shedden

unread,
Feb 20, 2015, 8:43:19 AM2/20/15
to pystat...@googlegroups.com
The AR code is sort of a home-brew thing I did to accommodate unequal spacing and multiple dimensions.  At the time I didn't have a good reference for the standard gridded AR, but I now know that it is in the stata docs:


I will put together a PR for this that uses the gridded approach when possible, and safeguards the bracket search for the other approach.

Kerby

Kerby Shedden

unread,
Feb 20, 2015, 9:32:35 AM2/20/15
to pystat...@googlegroups.com
If you use plot_isotropic_dependence on the firms data (fit using independence or exchangeable cov_struct) you see a U-shaped pattern in the empirical autocorrelations.

The AR method we currently have is an AR^1 which assumes exponential decay in pairwise correlations over time.  This particular type of AR is not a good fit for this data.

m jadidi

unread,
Jun 7, 2016, 12:46:42 PM6/7/16
to pystatsmodels
any solution/update for this issue? I'm encountering the same problem,

Kerby Shedden

unread,
Jun 8, 2016, 8:55:04 AM6/8/16
to pystatsmodels
If you use `plot_isotropic_dependence` does it look like an AR-1 model should fit?

Are your time points equally spaced?
Reply all
Reply to author
Forward
0 new messages