Getting Stata results into python [Was Re: how are we doing?]

102 views
Skip to first unread message

Skipper Seabold

unread,
May 2, 2012, 2:14:04 PM5/2/12
to pystat...@googlegroups.com
On Wed, May 2, 2012 at 1:53 PM, <josef...@gmail.com> wrote:
> On Wed, May 2, 2012 at 1:09 PM, Skipper Seabold <jsse...@gmail.com> wrote:
>> Looks like we need to put documentation push on our plates. This is
>> feasible for me in medium-term future.
>>
>> "The kdensity function, in statsmodels.nonparametric.kde provides that
>> and other kernels, but given the state of statsmodels‘ documentation,
>> you would probably only find this function by accident. It’s also
>> substantially slower than gaussian_kde on large data... Statsmodels,
>> generally, seems to have a lot of undocumented functionality; not
>> surprising for a young, rapidly-expanding project."
>
> I pretty much agree (ouside the main models). I sometimes don't know
> where to look for a function.
> (Where did I put the sandwiches, again?)
>
> I got STATA this week and have been playing for a bit. I like their
> menus, especially for Statistics. However, I haven't found a printable
> version of the menu tree yet.
>
> Some things are missing in the main toc tree, like the nonparametric kde.
> For some parts we might need more overview pages.
>
> I was reading today
> http://statsmodels.sourceforge.net/devel/diagnostic.html#diagnostics
> and it looks useful to me, but needs more proof-reading.
>
> I'm trying to figure out how to write tests based on STATA results, so
> I can clean-up some models and unhide them or make them "official"
> ZeroInflatedPoisson, HetGLS (STATA uses ARCH/GARCH with MLE for this),
> IV2SLS is also correct, ...
>

Prepare to have your mind blown. Drop
statsmodels/tools/mat2nparray.ado somewhere on your path (pushing my
updated version now that we have other stata users). Type adopath in
stata to see where it goes. You'll want it in the ado/personal/ one
most likely. You can use it like this

webuse auto
reg mpg weight foreign
predict yhat
mkmat yhat yhat
mat2nparray yhat, saving("/home/skipper/scratch/yhat.py") format("%16.0g")

predict resid, res
mkmat resid resid

/*see what else we can get from results*/
eret list

mat F = e(F)

mat2nparray yhat resid F, saving("/home/skipper/scratch/results.py")
format("%16.0g") replace

The big idea is that you can only save matrices (even if these won't
fit in stata's limited matrix memory size, which is nice.) So you've
got to use mkmat to turn stata variables into matrices and the mat
command to get ereturn results.

I now have two files that are importable in Python (!).

[~/scratch/]
[1]: import yhat, results

[~/scratch/]
[2]: yhat.yhat[:10]
[2]:
array([ 22.37719536, 19.6102829 , 24.28768158, 20.26907158,
14.80112553, 17.50215912, 26.98871613, 20.07143593,
16.11870384, 19.28088951])

[~/scratch/]
[3]: yhat.results.keys()
[3]: ['yhat']

[~/scratch/]
[4]: results.results.keys()
[4]: ['yhat', 'resid', 'F']

[~/scratch/]
[5]: results.F
[5]: array([ 69.74846262])

Skipper

josef...@gmail.com

unread,
May 2, 2012, 2:35:10 PM5/2/12
to pystat...@googlegroups.com
I'm still trying to figure out some of the basic commands. I was using
your .do files as pattern, but haven't started with your export script
yet.

two things:

how do I get confidence intervals or all results from the parameter table out?
Is there an easy way to get the content of e() or r(), all of it in a
loop? copy-paste doesn't create valid python.

BTW: I have some additional export functions for R, one I found easy
to use is to export as json.


. zip count child camper, inflate(_cons) vuong

Fitting constant-only model:

Iteration 0: log likelihood = -1347.807
Iteration 1: log likelihood = -1316.0866
Iteration 2: log likelihood = -1127.8174
Iteration 3: log likelihood = -1127.023
Iteration 4: log likelihood = -1127.0229

Fitting full model:

Iteration 0: log likelihood = -1127.0229
Iteration 1: log likelihood = -1047.8564
Iteration 2: log likelihood = -1038.9185
Iteration 3: log likelihood = -1038.9071
Iteration 4: log likelihood = -1038.9071

Zero-inflated Poisson regression Number of obs = 250
Nonzero obs = 108
Zero obs = 142

Inflation model = logit LR chi2(2) = 176.23
Log likelihood = -1038.907 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
count | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
count |
child | -.9277419 .0958062 -9.68 0.000 -1.115519 -.7399652
camper | .80334 .0929726 8.64 0.000 .621117 .9855629
_cons | 1.616819 .0848482 19.06 0.000 1.450519 1.783118
-------------+----------------------------------------------------------------
inflate |
_cons | .0603225 .1429108 0.42 0.673 -.2197776 .3404225
------------------------------------------------------------------------------
Vuong test of zip vs. standard Poisson: z = 3.55 Pr>z = 0.0002

. ereturn
invalid syntax
r(198);

. ereturn list,

scalars:
e(rank) = 4
e(N) = 250
e(ic) = 4
e(k) = 4
e(k_eq) = 2
e(k_dv) = 1
e(converged) = 1
e(rc) = 0
e(ll) = -1038.907092905589
e(k_eq_model) = 1
e(ll_0) = -1127.022936563995
e(chi2) = 176.2316873168124
e(p) = 5.74382355453e-38
e(vuong) = 3.550164262077308
e(N_zero) = 142
e(df_m) = 2
e(df_c) = 1

macros:
e(cmdline) : "zip count child camper, inflate(_cons) vuong"
e(cmd) : "zip"
e(predict) : "zip_p"
e(inflate) : "logit"
e(chi2type) : "LR"
e(opt) : "moptimize"
e(vce) : "oim"
e(title) : "Zero-inflated Poisson regression"
e(user) : "zip_llf"
e(ml_method) : "e2"
e(technique) : "nr"
e(which) : "max"
e(depvar) : "count"
e(properties) : "b V"

matrices:
e(b) : 1 x 4
e(V) : 4 x 4
e(ilog) : 1 x 20
e(gradient) : 1 x 4

functions:
e(sample)

. matrix list e(b)

e(b)[1,4]
count: count: count: inflate:
child camper _cons _cons
y1 -.92774194 .80333999 1.6168186 .06032245

. matrix list e(V)

symmetric e(V)[4,4]
count: count: count: inflate:
child camper _cons _cons
count:child .00917883
count:camper -.00046072 .0086439
count:_cons -.00089341 -.00705893 .00719922
inflate:_cons .00329854 -.0011287 .00078925 .02042351

Skipper Seabold

unread,
May 2, 2012, 2:55:16 PM5/2/12
to pystat...@googlegroups.com
You don't want to loop through matrices, etc. unless you're using
mata, which you probably won't want to do too much.

You can use estout though to get regression tables and store the results.

ssc install estout

webuse auto
reg mpg weight foreign
est sto last_reg
esttab last_reg, ci
ret list
mat coefs = r(coefs)
mat2nparray coefs, saving("/home/skipper/scratch/coefs.py") format("%16.0g")

>
> BTW: I have some additional export functions for R, one I found easy
> to use is to export as json.
>

You also know about this right, which creates importable python from R?

https://github.com/statsmodels/statsmodels/tree/master/tools/R2nparray

josef...@gmail.com

unread,
May 2, 2012, 3:08:40 PM5/2/12
to pystat...@googlegroups.com
I will go through it, it still sounds like a foreign language to me.

>
>>
>> BTW: I have some additional export functions for R, one I found easy
>> to use is to export as json.
>>
>
> You also know about this right, which creates importable python from R?
>
> https://github.com/statsmodels/statsmodels/tree/master/tools/R2nparray

Yes, but the tests I was writing were for statistical tests, so I
wrote variations on you functions that automatically save the htest
class. I didn't have many matrices, mainly scalars. Using json was
also nice because it saves everything with one command instead of
having to list each matrix or scalar result.

Thanks,

Josef

Skipper Seabold

unread,
May 2, 2012, 3:12:37 PM5/2/12
to pystat...@googlegroups.com
Well, it is. help is obviously helpful. Always click through to the
pdf help if you want more info, because it will have a lot more info
(including discussions of the stats/estimators). A lot of the commands
are abbreviations if you haven't seen that yet. Ie., r == ret ==
return and e == est == estimate.

Another useful command when you don't find things by using help is findit.

Let me know how it's going.

josef...@gmail.com

unread,
May 2, 2012, 10:36:41 PM5/2/12
to pystat...@googlegroups.com
Stata looks like fun (as long as I don't have to figure out how to
program in it)

import statsmodels.miscmodels.count as mcount

data = np.load(r"E:\Josef\eclipsegworkspace\statsmodels-git\local_scripts\local_scripts\fish.npy")
endog = data["count"]
exog = np.column_stack((data["child"], data["camper"], np.ones(len(endog))))

mod = mcount.PoissonZiGMLE(endog, exog)
res = mod.fit()

versus
zip count child camper, inflate(_cons) vuong

method='nm' doesn't match STATA's bse, method='bfgs' does, no idea why
we might have to think again about switching optimizers in the middle
like STATA.

Optimization terminated successfully.
Current function value: 1038.907091
Iterations: 250
Function evaluations: 423
params
sm [-0.92771311 0.80331725 1.61683668 -0.06033268]
st [-0.92774194 0.80333999 1.61681858 0.06032245]
bse
sm [ 0.1017595 0.09384142 0.08533824 0.26909025]
st [ 0.09580624 0.0929726 0.08484823 0.14291083]
llf
sm -1038.90709093
st -1038.90709291

continue with bfgs

Warning: Desired error not necessarily achieveddue to precision loss
Current function value: 1038.907091
Iterations: 3
Function evaluations: 101
Gradient evaluations: 78
params
sm [-0.92774926 0.8033496 1.6168081 -0.06035464]
st [-0.92774194 0.80333999 1.61681858 0.06032245]
bse
sm [ 0.09572238 0.09296411 0.08484285 0.14014844]
st [ 0.09580624 0.0929726 0.08484823 0.14291083]
llf
sm -1038.9070909
st -1038.90709291

Josef

josef...@gmail.com

unread,
May 27, 2012, 10:21:57 AM5/27/12
to pystat...@googlegroups.com
I learned a bit more about Stata programming (getting familiar with
lots of quotes)

I added a few more options to mat2nparray, I renamed it because it has
different defaults
writes e() into a dict and adds it to bunch

https://github.com/josef-pkt/statsmodels/commit/2a837a2fec26c0af1eb224972949c102fc19a615

My adjustments are for Stata 12, so I don't know if everything works
the same for Stata 11

also I found
matrix params_table = r(table)'
which gets params, bse t/zvalues, ... for the last estimate

example file created for cox regression is below, I added the do file
as an example to tools (which also dumps a bunch of csv files)

I also have a script that extracts the data from a Stata plot into a
csv file, but that doesn't have a clean copyright since I adjusted a
downloaded ado file.

Josef

-------------------
import numpy as np

est = dict(
risk = 31938.09999990463,
N_fail = 75,
N_sub = 103,
df_m = 2,
rank = 2,
converged = 1,
r2_p = .0091231867926981,
chi2 = 5.443169407354162,
ll = -295.5935507329181,
ll_0 = -298.3151354365951,
N = 172,
cmdline = "stcox ",
datasignaturevars = "_t _t0 _d age posttran",
datasignature = "172:5:3138366353:569466897",
cmd2 = "stcox",
marginsnotok = "CSNell DEViance DFBeta ESR LDisplace LMax
MGale SCAledsch SCHoenfeld SCores",
marginsprop = "addcons",
k_eform = "1",
footnote = "stcox_footnote",
predict = "stcox_p",
estat_cmd = "stcox_estat",
vce = "oim",
chi2type = "LR",
method = "breslow",
ties = "breslow",
t0 = "_t0",
depvar = "_t",
cmd = "cox",
crittype = "log likelihood",
properties = "b V",
)

params_table = np.array([ 1.0323969785881,
.01505912144071,
2.1857970773551,
.02883045469589,
1.0032995597729,
1.0623382727677,
np.nan,
1.9599639845401,
1,
.96759949129124,
.29895753046714,
-.10660326750099,
.91510372626864,
.52808200941855,
1.7729230665857,
np.nan,
1.9599639845401,
1]).reshape(2,9)

params_table_colnames = 'b se z pvalue ll ul df crit eform'.split()

params_table_rownames = 'age posttran'.split()

cov = np.array([ .00021276776001,
-.00082941216974,
-.00082941216974,
.09546138521867]).reshape(2,2)

cov_colnames = 'age posttran'.split()

cov_rownames = 'age posttran'.split()

class Bunch(dict):
def __init__(self, **kw):
dict.__init__(self, kw)
self.__dict__ = self


results = Bunch(
params_table=params_table,
params_table_colnames=params_table_colnames,
params_table_rownames=params_table_rownames,
cov=cov,
cov_colnames=cov_colnames,
cov_rownames=cov_rownames,
**est
)

import numpy as np

est = dict(
risk = 31938.09999990463,
N_fail = 75,
N_sub = 103,
df_m = 2,
rank = 2,
converged = 1,
r2_p = .0091231867926981,
chi2 = 5.443169407354162,
ll = -295.5935507329181,
ll_0 = -298.3151354365951,
N = 172,
cmdline = "stcox ",
datasignaturevars = "_t _t0 _d age posttran",
datasignature = "172:5:3138366353:569466897",
cmd2 = "stcox",
marginsnotok = "CSNell DEViance DFBeta ESR LDisplace LMax
MGale SCAledsch SCHoenfeld SCores",
marginsprop = "addcons",
k_eform = "1",
footnote = "stcox_footnote",
predict = "stcox_p",
estat_cmd = "stcox_estat",
vce = "oim",
chi2type = "LR",
method = "breslow",
ties = "breslow",
t0 = "_t0",
depvar = "_t",
cmd = "cox",
crittype = "log likelihood",
properties = "b V",
)

params_table = np.array([ 1.0323969785881,
.01505912144071,
2.1857970773551,
.02883045469589,
1.0032995597729,
1.0623382727677,
np.nan,
1.9599639845401,
1,
.96759949129124,
.29895753046714,
-.10660326750099,
.91510372626864,
.52808200941855,
1.7729230665857,
np.nan,
1.9599639845401,
1]).reshape(2,9)

params_table_colnames = 'b se z pvalue ll ul df crit eform'.split()

params_table_rownames = 'age posttran'.split()

cov = np.array([ .00021276776001,
-.00082941216974,
-.00082941216974,
.09546138521867]).reshape(2,2)

cov_colnames = 'age posttran'.split()

cov_rownames = 'age posttran'.split()

class Bunch(dict):
def __init__(self, **kw):
dict.__init__(self, kw)
self.__dict__ = self


results = Bunch(
params_table=params_table,
params_table_colnames=params_table_colnames,
params_table_rownames=params_table_rownames,
cov=cov,
cov_colnames=cov_colnames,
cov_rownames=cov_rownames,
**est
)

-------------------------
Reply all
Reply to author
Forward
0 new messages