GEE with Autoregressive dependence structure

50 views

Skip to first unread message

m jadidi

unread,

Jul 24, 2019, 10:45:34 AM7/24/19

to pystatsmodels

hey all,

my question is very similar to the one here. my panel is unbalanced. The number of observations varies among subjects. Also there are many dropouts.

year subject_count
1     42736
2     27146
3     23328
4     19795
5     16898
6     14061
7     11530
8      9155
9      7141
10     5260
11     3770
12     2524
13     1727
14     1140
15      771
16      582
17      374
18      166
19       57
20       11
21        6
22        5
23        4
24        2
25        5
26        1

I want to use a AR mode to account for the autocorrelaton between observations. From the other post I know I can't use AR when time points are not equally spaced, is that right? ("ValueError: Autoregressive: unable to find right bracket")

I was thinking to include a variable that account for the lag between consecutive obsevations ,for example y_t-1 for y_t. Would that work? or apply a filter like Baxter-King filter(statsmodels.tsa.filters.bkfilter) or the Hodrick-Prescott Filter (statsmodels.tsa.filters.hpfilter) to remove the moving average trend lines for each group and then use Exchangeable() structure.

Is it possible to trick the Autoregressive fucntion by writing my own dist_func?

This is how my current my data:

         artist_name  year  closeness_event  kcore_event  betweenness_event  \
0         colin dale  1993         0.575553    -0.848016          -0.078891   
1         jeff mills  1993         0.581378    -0.849989          -0.070038   
2       paul van dyk  1993         0.571694    -0.856506          -0.162971   
3      robert armani  1996        -3.607296    -0.845350          65.450321   
4  claudio coccoluto  1998        -3.319848    -0.865555          -0.127938   

   clustering_coeff_event  career_age  release_count  travel_dist  \
0                1.183177           1      -0.569684    -0.383515   
1                1.176884           4       0.072075    -0.379455   
2                1.188049           5      -0.304109    -0.385076   
3                1.172376           6      -0.568790    -0.382209   
4               -2.937427           7       0.085921    -0.381167   

   past_success  decade  
0     -0.339579     1.0  
1      6.717467     1.0  
2      6.695121     1.0  
3     -0.353896     1.0  
4     -0.026755     2.0

Here is the model and results when I use Gaussian distribution and Exchangeable structure:

formula ="travel_dist~C(decade,Treatment(reference=1))+closeness_event+kcore_event+betweenness_event+clustering_coeff_event+career_age+release_count+past_success"

mod = GEE.from_formula(formula, "artist_name", df,groups=df['artist_name'],family=Gaussian(),time='career_age', cov_struct=Exchangeable(),missing='drop')

                               GEE Regression Results                              
===================================================================================
Dep. Variable:                 travel_dist   No. Observations:               169101
Model:                                 GEE   No. clusters:                    21020
Method:                        Generalized   Min. cluster size:                   5
                      Estimating Equations   Max. cluster size:                  19
Family:                           Gaussian   Mean cluster size:                 8.0
Dependence structure:         Exchangeable   Num. iterations:                     3
Date:                     Wed, 24 Jul 2019   Scale:                           0.073
Covariance type:                    robust   Time:                         16:27:48
============================================================================================================
                                               coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                    0.3460      0.108      3.214      0.001       0.135       0.557
C(decade, Treatment(reference=1))[T.2.0]    -0.1880      0.108     -1.741      0.082      -0.400       0.024
C(decade, Treatment(reference=1))[T.3.0]    -0.2938      0.108     -2.725      0.006      -0.505      -0.082
C(decade, Treatment(reference=1))[T.4.0]    -0.3270      0.108     -3.035      0.002      -0.538      -0.116
C(decade, Treatment(reference=1))[T.5.0]    -0.3488      0.108     -3.237      0.001      -0.560      -0.138
closeness_event                              0.0002      0.001      0.295      0.768      -0.001       0.002
kcore_event                                 -0.0006      0.001     -0.672      0.502      -0.002       0.001
betweenness_event                           -0.0033      0.001     -3.980      0.000      -0.005      -0.002
clustering_coeff_event                      -0.0015      0.001     -2.178      0.029      -0.003      -0.000
career_age                                  -0.0061      0.000    -13.936      0.000      -0.007      -0.005
release_count                                0.0249      0.002     11.276      0.000       0.021       0.029
past_success                                 1.0261      0.003    331.533      0.000       1.020       1.032
==============================================================================
Skew:                         -2.4159   Kurtosis:                      39.7054
Centered skew:                -0.6141   Centered kurtosis:             34.5069
==============================================================================

result.cov_struct.dep_params

0.3469744871809727

fig = res.plot_isotropic_dependence()
plt.grid(True)

I am very confused and I appreciate any tip!

cheers,

Mohsen

Reply all

Reply to author

Forward

0 new messages