Hi,
sorry for the delay in posting, I got the notification about pending
messages just a short time ago.
On Wed, Mar 15, 2017 at 12:14 PM, Parin Sripakdeevong
<
pa...@jimmy.harvard.edu> wrote:
> Hi All,
>
> I am a research scientist at Dana-Farber Cancer Institute.
>
> I have been using the pystatsmodels to run logistic regression models
> (specifically the functions sm.Logit() and sm.GLM(,,families.Binomial())
>
> I am working with data where the # of positive outcomes in certain
> explanatory group has very low prevalence (e.g. ~10 positives, ~10,000
> negatives).
>
> In this situation, the maximum likelihood estimation in convention logistic
> model is known to suffer from small-sample bias. For discussion, see
>
http://statisticalhorizons.com/logistic-regression-for-rare-events.
>
> I searched online and it seem that the solution is to use 'Firth Logistic
> Regression'. This model appear to be supported in both SAS and R. For
> example, see
>
https://www.r-bloggers.com/example-8-15-firth-logistic-regression/.
>
> However, it appear that Firth Logistic Regression is currently not support
> by pystatsmodel. Is this correct?
correct, it's not supported
we have some related open issues
https://github.com/statsmodels/statsmodels/issues/2293
https://github.com/statsmodels/statsmodels/issues/2282
>
> If so, can anyone give me any tips/instructions on how I could extend/modify
> the existing functionality of pystatsmodel to support Firth Logistic
> Regression?
I only know about Firth for handling perfect separation cases, and
didn't know about the bias correction part. I also never looked at
what kind of penalization Firth Logit actually uses.
We have now elastic net L1/L2 penalization for GLM and more structured
L2 penalization waiting in a PR. However, based on a quick look, Firth
seems to use the log determinant of the information matrix (jeffreys'
prior) for the penalization term which is not covered by any of the
existing or work in progress penalization terms.
Based on my brief skimming, there might be a way to implement this by
data augmentation, otherwise we would need to use either a special
case IRLS or a generic penalization method.
So for now I don't know what the best way of implementing this is.
Josef