source code for logistic regression

41 views
Skip to first unread message

Jason Dou

unread,
Nov 3, 2021, 8:36:17 PM11/3/21
to pystatsmodels
I got a nan p value and want to figure out the reason. Is there a way to look at the source code of logistic regression?

josef...@gmail.com

unread,
Nov 3, 2021, 8:56:37 PM11/3/21
to pystatsmodels
On Wed, Nov 3, 2021 at 8:36 PM Jason Dou <dou...@gmail.com> wrote:
I got a nan p value and want to figure out the reason. Is there a way to look at the source code of logistic regression?

It's not so easy, because most of the code of Logit model is inherited..
statsmodels.discrete.discrete_model.Logit itself defines loglike, score and hessian. 
Optimization and inference like p-values are inherited from computed in base.model.LikelihoodModel and LikelihoodModelResults.

If you get nan standard error bse and p-value, then most likely you have a Hessian, cov_params that is not positive definite.
Either the design matrix exog is (near) singular or there are some convergence failures. 
Model specific problems like perfect separation also cause problems for maximum likelihood estimation. We warn for cases that can be easily identified but there can be problems with the data/model that are more difficult to identify.

If some values of exog are large, then exp in logit might overflow, but that would most likely raise an exception or have nan everywhere.

There is a lot of code involved in this. It's usually better to directly verify that your data, design matrix is well behaved (no singular/eigen values close to zero)..
And then work backwards from bse and cov_params and check why that has a nan.

Josef

 

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/704a3e72-7cba-42ee-a007-9b0b7cd01852n%40googlegroups.com.

Jason Dou

unread,
Nov 8, 2021, 10:22:42 AM11/8/21
to pystatsmodels
Thanks for the detailed response!! Two sets of data give me the nan p-value problem. one dataset is a matrix with all 0 but one 0.01, I guess for this it makes sense p value is nan(maybe very close to 1?) but I don't know why the LL likelyhood is ~-27. Is there any randomness involved in the implementation of logistic regression model here?

josef...@gmail.com

unread,
Nov 8, 2021, 11:40:02 AM11/8/21
to pystatsmodels
On Mon, Nov 8, 2021 at 10:22 AM Jason Dou <dou...@gmail.com> wrote:
Thanks for the detailed response!! Two sets of data give me the nan p-value problem. one dataset is a matrix with all 0 but one 0.01, I guess for this it makes sense p value is nan(maybe very close to 1?) but I don't know why the LL likelyhood is ~-27. Is there any randomness involved in the implementation of logistic regression model here?

The algorithm itself is deterministic, we don't add any random noise.

However, there are two main sources why results can differ across runs and machines

- convergence tolerance: by default we only get optimization at precision 1-5 to 1e-8. Different versions of the optimizer can produce results that differ in that range

- floating point noise: all computations are only "deterministic" subject to floating point computation.
  In regular, well posed cases, this might only affect the last digits of the result, e.g. our unit test work in those cases with something like rtol=1e-13.
  In ill posed problems, the floating point noise might dominate. For example, if the smallest singular value = 1e-14, then some results often are only or mostly floating point noise, and precision might not be even 1 digit.
  Related problem that I guess you have, is that if a value is theoretically zero or very small, then even small floating point noise can make it negative which cause problems if that creates a "negative variance".

The dataset you mention sounds ill posed. The information in the data is too small to avoid floating point noise around zero.
In your case, ll might looks reasonable, because it uses computation that are less numerically fragile than for example the computation for standard errors bse.

for example
a = 1 + eps1
b = 1+ eps2
eps1 and eps2 are floating point noise

a and b are around 1 with high precision
a - b and (a - 1) / (b - 1) are just noise.
a - b is noise around zero which still has good absolute precision but no relative precision

Josef



 

On Wednesday, November 3, 2021 at 8:56:37 PM UTC-4 josefpktd wrote:
On Wed, Nov 3, 2021 at 8:36 PM Jason Dou <dou...@gmail.com> wrote:
I got a nan p value and want to figure out the reason. Is there a way to look at the source code of logistic regression?

It's not so easy, because most of the code of Logit model is inherited..
statsmodels.discrete.discrete_model.Logit itself defines loglike, score and hessian. 
Optimization and inference like p-values are inherited from computed in base.model.LikelihoodModel and LikelihoodModelResults.

If you get nan standard error bse and p-value, then most likely you have a Hessian, cov_params that is not positive definite.
Either the design matrix exog is (near) singular or there are some convergence failures. 
Model specific problems like perfect separation also cause problems for maximum likelihood estimation. We warn for cases that can be easily identified but there can be problems with the data/model that are more difficult to identify.

If some values of exog are large, then exp in logit might overflow, but that would most likely raise an exception or have nan everywhere.

There is a lot of code involved in this. It's usually better to directly verify that your data, design matrix is well behaved (no singular/eigen values close to zero)..
And then work backwards from bse and cov_params and check why that has a nan.

Josef

 

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/704a3e72-7cba-42ee-a007-9b0b7cd01852n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.

josef...@gmail.com

unread,
Nov 8, 2021, 12:49:54 PM11/8/21
to pystatsmodels
On Mon, Nov 8, 2021 at 11:40 AM <josef...@gmail.com> wrote:


On Mon, Nov 8, 2021 at 10:22 AM Jason Dou <dou...@gmail.com> wrote:
Thanks for the detailed response!! Two sets of data give me the nan p-value problem. one dataset is a matrix with all 0 but one 0.01, I guess for this it makes sense p value is nan(maybe very close to 1?) but I don't know why the LL likelyhood is ~-27. Is there any randomness involved in the implementation of logistic regression model here?

The algorithm itself is deterministic, we don't add any random noise.

However, there are two main sources why results can differ across runs and machines

- convergence tolerance: by default we only get optimization at precision 1-5 to 1e-8. Different versions of the optimizer can produce results that differ in that range

- floating point noise: all computations are only "deterministic" subject to floating point computation.
  In regular, well posed cases, this might only affect the last digits of the result, e.g. our unit test work in those cases with something like rtol=1e-13.
  In ill posed problems, the floating point noise might dominate. For example, if the smallest singular value = 1e-14, then some results often are only or mostly floating point noise, and precision might not be even 1 digit.
  Related problem that I guess you have, is that if a value is theoretically zero or very small, then even small floating point noise can make it negative which cause problems if that creates a "negative variance".

The dataset you mention sounds ill posed. The information in the data is too small to avoid floating point noise around zero.
In your case, ll might looks reasonable, because it uses computation that are less numerically fragile than for example the computation for standard errors bse.

There are examples where some coefficients are not identified and what their values are depends on the algorithm and noise, but some summary statistics are still identified and estimated with good precision.

Example multicollinearity: add one variable twice in the regressor matrix (exog)

In that case, the individual coefficients are not identified; however the sum of the two effects is the same as adding the variable only once.

So, the sum `a1 x + a2 x` is high precision, residuals, loglike, rsquared  and in-sample prediction are not affected by the multicollinearity.
However, we can only reliably estimate a1 + a2, how we split up the sum is arbitrary.

Josef
Reply all
Reply to author
Forward
0 new messages