Looping through regressions

164 views
Skip to first unread message

Robert Garrison II

unread,
Oct 19, 2016, 8:10:25 PM10/19/16
to pystatsmodels
I understand that this may be an elementary question, but bear with me.

I have a 16M record dataframe, and I want to perform regressions on 60 subsets of the dataframe defined by Variable_A.  I created a list of the subsets (e.g. subsets=list(set(df['Variable_A'])) ), that I want to iterate through.

I set up the following regression loop:

for i in subsets:
     i=smf.ols('response ~ C(variable_1) + C(variable_2) +...+C(variable_N)',data=df.loc[(df[Variable_A]==str(i)])]).fit()

Variable A is not a regressor in the model.  When I attempt to obtain the summary from a particular regression (e.g. males.summary(), assuming males is a variable in Variable_A), Python states that 'males' is not defined.  I want a regression item for each element in subsets.  What would be the best way to resolve this issue?  Should I create a separate variable outside of the loop, and if so, what type?  I know that the output is a RegressionResults class instance.

josef...@gmail.com

unread,
Oct 19, 2016, 10:15:11 PM10/19/16
to pystatsmodels


On Wed, Oct 19, 2016 at 8:10 PM, Robert Garrison II <garrison...@gmail.com> wrote:
> I understand that this may be an elementary question, but bear with me.
>
> I have a 16M record dataframe, and I want to perform regressions on 60
> subsets of the dataframe defined by Variable_A.  I created a list of the
> subsets (e.g. subsets=list(set(df['Variable_A'])) ), that I want to iterate
> through.
>
> I set up the following regression loop:
>
> for i in subsets:
>      i=smf.ols('response ~ C(variable_1) + C(variable_2)
> +...+C(variable_N)',data=df.loc[(df[Variable_A]==str(i)])]).fit()

i = ...
assigns the results of the estimate to the index variable `i` which will be assigned something else in the next round through the loop

one standard python way would be to use a dict with element keys `i`:

e.g.

res_all = {}

for i in subsets:
     res_all[i] = smf.ols('response ~ C(variable_1) + C(variable_2) +... + C(variable_N)', data=df.loc[(df[Variable_A]==str(i)])]).fit()

print(res["males"].summary()

If sequence is important, then an ordereddict or a list would be more appropriate.

I often use list of tuples [(name, result)] if I only want to process in sequence

Josef
Reply all
Reply to author
Forward
0 new messages