formula.api.poisson: Translate R's glm() with poisson and reference to statsmodels

12 views
Skip to first unread message

c.b...@posteo.jp

unread,
Jun 3, 2024, 8:20:10 AMJun 3
to pystat...@googlegroups.com
Hello,

I try to "translate" R-Code into statsmodels and I am stuck near the
goal.
Init the past I was using ols() and glm() functions of statsmodels. Now
I need a Poission regression.

In R it looks like this:

Reg <- glm(VAR ~ AGE+FOO+BAR, data=df, family=poisson(),
offset(DAYS))

I am assuming because of the "family=poission()" that statsmodels glm()
is not the choice but formula.api.poisson() is. Am I right so far?
But what I am missing in the docu is how I do set the "offset".

My current state of a solution looks like this:

import statsmodels.formula.api as smapi
formula = 'VAR ~ AGE+FOO+BAR'
result = smapi.poisson(formula=formula, data=df)

But what is about "statsmodels.discrete.discrete_model.Poisson()"? It is
a class. There is an "offset" argument. But I don't now how to use a
class to execute a model.

Thanks in advance
Christian Buhtz

josef...@gmail.com

unread,
Jun 3, 2024, 8:24:37 AMJun 3
to pystat...@googlegroups.com
offset should work in the same way in discrete and GLM
(I don't find an example right now)

`smapi.poisson(formula=formula, data=df, offset=my_offset)`
where my_offset is a 1-dim array, or pandas Series

GLM with family Poisson estimates the same model
but GLM and discrete differ in some of the extras that are available.

Did you get any error messages?

Josef


--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/9f5487508a469ecac8f090e0c6878069%40posteo.de.

c.b...@posteo.jp

unread,
Jun 4, 2024, 4:05:03 AMJun 4
to pystat...@googlegroups.com
Hello Josef,

Thank you for the reply.

Am 03.06.2024 14:24 schrieb josef...@gmail.com:
`smapi.poisson(formula=formula, data=df, offset=my_offset)`
where my_offset is a 1-dim array, or pandas Series

The docu has room for improvement because "offset" is not listed there
as an argument and the possible values for "kwargs" are not described or
linked.

Did you get any error messages?

Yes.

x = reg.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value:
4458231981365606423606362108676491528138936007099102773782264681006035469423790276026214755322372668336710045212855954776080397453649510400.000000
Iterations: 35
C:\python\Lib\site-packages\statsmodels\base\model.py:607:
ConvergenceWarning: Maximum Likelihood optimization failed to converge.
Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "

I don't understand that message. Maybe some columns in the original data
frame has a wrong dtype?

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 775973 entries, 4 to 4358160
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PERSON 775973 non-null object
1 RETTUNGEN_N 775973 non-null int64
2 Alter 775973 non-null int32
3 Altersgruppe 775973 non-null category
4 Gender 775973 non-null category
5 VTAGE 775973 non-null int32
6 Pflegeform 775973 non-null int64
7 PFLEGEBEDUERFTIG 775973 non-null category
8 Vielfahrer 775973 non-null int64
dtypes: category(3), int32(2), int64(3), object(1)
memory usage: 37.7+ MB

Best,
Christian

josef...@gmail.com

unread,
Jun 4, 2024, 11:00:10 AMJun 4
to pystat...@googlegroups.com
The  "Current function value" format looks strange, large number of digits instead of scientific notation like 1e+50.

Check that the model matrix `model.exog` is float64 to have the proper dtype.
I'm not sure whether `model.endog` needs to be float or can be int, I guess int is fine.
Also, a useful check for numerical problems is to look at matrix rank and condition number of exog to see whether there is high multicollinearity or near singularity.

You could try other optimizers to see if it is a convergence problem with the default optimizer.
For example, use `method="nm"` which is usually pretty robust to convergence problems, but slow and not very precise at optimum.
"nm" needs a large maxiter.

Also irls in GLM is often a good optimization algorithm even for more difficult problems.

If you have an approximate solution, then method="newton" is usually the most precise in the neighborhood of the solution.
(it usually brings the gradient, score closer to zero.)

Otherwise, it's difficult to tell what the problem might be without seeing the data.

Poisson and other models with exp in the link function can run into scaling problems (overflow if exog values are large).
But this usually results in numerical warnings like zero-division which don't seem to be a problem in your case.

Josef



--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages