Outreg / Outreg2 functionality for statsmodels

666 Aufrufe
Direkt zur ersten ungelesenen Nachricht

Benjamin Kay

ungelesen,
01.05.2016, 20:28:2601.05.16
an pystatsmodels

Outreg and outreg2 are Stata modules which allow you to combine several regressions into a single table. The output looks something like this (if you choose the latex option):

The modules provide a functionality that allows customized fields, critical values, and format (doc, excel, tex). I'm primarally interested in  output to tex, but it seems like once the data is in a table or DataFrame it pretty straightforward to format things.

For a single regression one can use the as_latex function (something like  res.summary().as_latex()) to out put a single regression result like this:

As far as I know there is no equivalent in Statsmodels for combining multiple regression results like the first figure.


Inspired by a economics stackexchage question (Outputting Regressions as Table in Python (similar to outreg in stata)?) I would like to help solve this problem. I've never contributed to a project like this or even an application used by anyone other than my research collaborators so I'm not sure my coding quite up to the challenge. However, I do have some basic code which is functional. I'd like to share that code and get feedback on how I might make it robustly functional, statsmodels standards compliant, and ready for inclusion more broadly. Here is my code:


import pandas as pd
import statsmodels.formula.api as smf

def makecoefftable(regresult):
 d
= pd.concat([regresult.params, regresult.HC0_se, regresult.tvalues, regresult.pvalues], axis=1)
 df
= pd.DataFrame(d)
 df
.columns = ["Betas", "Std. Errors", "Z Scores", "P Values"]
 df
['asterisks'] = pd.cut(df["P Values"], [0, 0.01, 0.05, 0.1, 1], include_lowest=False, labels=["***", "**", "*", ""])
 
# tablefoot1 = "Standard errors in parentheses"
 
# tablefoot2 = "*** p<0.01, ** p<0.05, * p<0.1"
 outputformatfncsign
= lambda x: "{:+.3f}".format(x)
 outputformatfncnosign
= lambda x: "{:.3f}".format(abs(x))
 outputformatfncnosignwparens
= lambda x: "(" + outputformatfncnosign(x) + ")"
 df
['BetaswStars'] = df['Betas'].apply(outputformatfncsign) + df["asterisks"].apply(str)
 df
['StarPadding'] = (df['BetaswStars'].map(len).max() - df['BetaswStars'].map(len)).values
 df
['StarPadding'] = df['StarPadding'].apply(lambda x: x*" " )
 
# df['BetaswStars'] = df['Betas'].apply(lambda x: int(x>=0)*" ") + df['BetaswStars'] + df['StarPadding']
 df
['BetaswStars'] = df['BetaswStars'] + df['StarPadding']
 
del df['StarPadding']
 df
['Betas'] = df['BetaswStars']
 
del df['BetaswStars']
 
del df['asterisks']
 df
['Std. Errors'] = df['Std. Errors'].apply(outputformatfncnosignwparens)
 df
["Z Scores"] = df["Z Scores"].apply(outputformatfncsign)
 df
["P Values"] = df["P Values"].apply(outputformatfncnosign)
 df2
= pd.melt(df.reset_index()[["index", "Betas", "Std. Errors"]], id_vars=["index"]).sort_values(by=["index"])
 df2
["fieldtype"]="coef"
 df2
.loc[df2.index.max() + 1,:] = ["Observations", "Betas", int(regresult.nobs), "stats"]
 df2
.loc[df2.index.max() + 1,:] = ["R-Squared", "Betas", outputformatfncnosign(regresult.rsquared), "stats"]
 
# df2["fieldtype"]=df2["variable"]
 df2
["Variable"] = (df2["variable"]=="Betas") * (df2["index"])
 
del df2["variable"]
 df2
= df2[["index", "Variable", "fieldtype", "value"]]
 
return(df2)

def makeoutregtable(listofregresults):
 
if len(listofregresults) != 1:
 dfoutput
= makecoefftable(listofregresults[0])
 
# dfoutput["source"] = 0
 
for result in listofregresults[1:]:
 dftmp
= makecoefftable(result)
 
# dftmp["source"] = i+1
 dfoutput
= dfoutput.merge(right=dftmp, how="outer", on=["index", "Variable", "fieldtype"])
 
else:
 dfoutput
= makecoefftable(listofregresults)
 regidxstrarray
= (np.arange(len(listofregresults))+1).astype(str)
 outregstylelabel
= [("(" + element + ")") for element in regidxstrarray.astype(str)]
 
# print(dfoutput.columns.values[:2])
 
# print(outregstylelabel)
 
# print(dfoutput)

 dfoutput
.columns = np.concatenate((dfoutput.columns.values[0:3].astype(str), outregstylelabel))
 dfoutput
= dfoutput.fillna("")
 dfoutput
= dfoutput.sort_values(["fieldtype", "index"])
 
return(dfoutput)


def finishoutregtable(dfoutput):
 dfformatted
= dfoutput.copy()
 
del dfformatted["index"]
 
del dfformatted["fieldtype"]
 tablefoot1
= "Standard errors in parentheses"
 tablefoot2
= '*** p$<$0.01, ** p$<$0.05, * p$<$0.1'
 dfformatted
.loc[dfformatted.index.max() + 1,"Variable"] = tablefoot1
 dfformatted
.loc[dfformatted.index.max() + 1,"Variable"] = tablefoot2
 dfformatted
= dfformatted.fillna("")
 
return(dfformatted)

def writelatexdocfromdf(df):
 beginningtex
= """\\documentclass{report}
 \\usepackage{booktabs}
 \\begin{document}"""

 endtex
= "\end{document}"

 
"""
 f = open(filename, 'w')
 f.write(beginningtex)

 f.write(df.to_latex(escape=False))
 f.write(endtex)
 f.close()
 """

 textable
= beginningtex + '\n' + df.set_index("Variable").to_latex(escape=False) + '\n' + endtex
 
return(textable)

Here is a simple example that makes use of the code


x
= [1, 3, 5, 6, 8, 3, 4, 5, 1, 3, 5, 6, 8, 3, 4, 5, 0, 1, 0, 1, 1, 4, 5, 7]
y
= [0, 1, 0, 1, 1, 4, 5, 7,0, 1, 0, 1, 1, 4, 5, 7,0, 1, 0, 1, 1, 4, 5, 7]
d
= { "x": pd.Series(x), "y": pd.Series(y)}
df
= pd.DataFrame(d)
df
['xsqr'] = df['x']**2  
mod
= smf.ols('y ~ x', data=df)
res
= mod.fit()
print(res.summary())
df
['xcube'] = df['x']**3  

mod2
= smf.ols('y ~ x + xsqr', data=df)
res2
= mod2.fit()
print(res2.summary())

mod3
= smf.ols('y ~ x + xsqr + xcube', data=df)
res3
= mod3.fit()
print(res2.summary())

reslistlong
= [res, res2, res3]
makeoutregtable
(reslistlong)
f
= open("myregs.tex", 'w')
f
.write(writelatexdocfromdf(finishoutregtable(makeoutregtable(reslistlong))))
f
.close()

The resulting tex file compiles to the following table:

The heavy usage of outreg in the Stata community suggests this would be a much used feature if included as part of statsmodels. Hopefully my code is useful and with help and advice it could be expanded into a more full featured functionality but at a minimum it can serve as proof of concept. Please let me know how I can help.

Thanks,

Benjamin

josef...@gmail.com

ungelesen,
02.05.2016, 12:31:2702.05.16
an pystatsmodels
Looks very good, I've seen several stackoverflow or stackexchange questions for this.

Actually, we do have something similar, but has never been advertised. And I also never looked very closely at it's usage and options.

One bonus feature that would be nice to have is to add t_test results if there are combined effects, e.g. joint test that b1 x +  b2 x**2 = 0 by testing [b1=0, b2=0]. t_test provides the same information as the params table.
I wrote `wald_test_terms` among other reasons because Stock Watson undergraduate econometrics included them in tables like this. and I thought it's very useful information.


I know that outreg and similar are very popular, but I never used or looked at it. So, I don't know what format and options are useful and feasible.

After writing the above, I spent a bit of time browsing a bit the documentation of outreg2, estout apsrtable (the R equivalent).  They all have a large number of options. outreg2 also supports other tables besides comparative regression summary. Some of those use cases are handled or would be better handled by pandas directly.
There is also a R port of outreg2 (GPL licensed) that is almost 1400 lines (I didn't look at the code itself thanks to GPL).


some thoughts, (just ideas given that I have not been a user of those functions)

It would be good to have something simple for standardized tables, like your function and summary_col.
It would be good to have something that can be expanded to get over time similar options as other packages.

We have SimpleTable that we use in results.summary() which has automatic conversion to latex and html, but is restricted to rectangular tables (each row has same number of columns)

class based implementation?
outreg2 mentions that it keeps in memory the options that have already been set, i.e. as global state. We don't want any global state, so the corresponding object for us would be a class.
I think Pierre's idea for usage of kde would be useful in this case: When we don't know in advance how the table should exactly look like, or when we want different versions of the table it is useful to just change some attributes, or call some setter methods while keeping all other options as before. This would avoid copying long statements with a large number of arguments specified in the function call.

pandas inside plus whatever nicely formatted output we need

Aside: most of our results classes have two implementation of summary, `summary` and `summary2`. `summary` is very restrictive but finetuned for fixed font text (according to my tasts). `summary2` is a lot more flexible and uses an underlying pandas Dataframe and (at least theoretically) allows wider choices of numerical formatting.
One problem with pandas DataFrame is or was that the restrictions on the index affect the possible output, IIRC, and currently we handle "pretty" table formatting and conversion mostly through SimpleTable.
(I didn't pay attention to any details about how output options for pandas DataFrames have changed in recent years.)


style sheets could be made available as predefined option dictionaries.

compatibility across models: 
Regression models is the most common usecase. However, once we include other models, it will be some compatibility layer for things that are not equivalently available. The params table and many other attributes are standardized across models, but things like Rsquared will not be available across models, some models have pseudo-rsquared, some don't have anything comparable.



I would be very interested in helping to get something like this into statsmodels, but I won't have time to work much on it myself.

Thanks,

Josef

















Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten