Logistic results

107 views
Skip to first unread message

Abhaya Indrayan

unread,
Jul 19, 2025, 11:50:28 AMJul 19
to MedS...@googlegroups.com
I am analysing a dataset on more than 100,000 adults regarding the prevalence of a disease and its possible predictors. The prevalence is nearly 25%. The logistic (forward selection) with 15 predictors finds 9 significant (because of extremely large n, I am using P<0.001 for statistical significance), and the percent correctly classified at the last step is nearly 70% with both 0.50 and 0.75 as the cutoffs.

A 70% correct classification is unacceptable after the logistic, since without logistic the correct classification is 75%.

What is it that I am doing wrong?

Thanks.

~~Abhaya Indrayan
--
Dr Abhaya Indrayan, MSc,MS,PhD(OhioState),FSMS,FAMS,FRSS,FASc
Personal website: http://indrayan.weebly.com

Jeremy Miles

unread,
Jul 19, 2025, 12:41:47 PMJul 19
to meds...@googlegroups.com
i think we need more information to answer.

If the  prevalence is 25% and you say that without logistic regression you get 75% correct, then you get that by saying no one has the disease. (Any rare disease you can get quite good accuracy by saying that no one has it).

You should think in terms of sensitivity and specificity instead of accuracy  - what proportion of people who are positive are given a positive diagnosis, what proportion of people who are negative are given a negative diagnosis.

I suspect many people would suggest that stepwise regression was not the most appropriate approach too.

Jeremy




--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/medstats/CAP7G4a4DvzNyknVwRMuvO4S8CSCOLgv1ACeHnnS6ks678DFduQ%40mail.gmail.com.

Paul Thompson

unread,
Jul 19, 2025, 2:17:12 PMJul 19
to meds...@googlegroups.com
This is an increasingly serious problem. I think the answer is to tighten up the sig levels. Stepwise only exacerbates an already problematic issue.

Paul A. Thompson

Tzippy Shochat

unread,
Jul 19, 2025, 2:18:24 PMJul 19
to meds...@googlegroups.com
Hi. In addition to what Jeremy wrote. You may have a collinearity issue. Tzippy


בתאריך שבת, 19 ביולי 2025, 19:41, מאת Jeremy Miles ‏<jeremy...@gmail.com>:

Diana Kornbrot

unread,
Jul 20, 2025, 6:16:11 AMJul 20
to meds...@googlegroups.com
Then answer is NEVER NEVER NEVER give significance levels without effect sizes
For logistic regression giving sensitivity and specificity separately is generally a good way of doing this that can reinterpreted easily by non statisticians

Diana Kornbrot

unread,
Jul 20, 2025, 6:17:08 AMJul 20
to meds...@googlegroups.com
Prio probability is also useful to set context

Abhaya Indrayan

unread,
Jul 20, 2025, 12:12:08 PMJul 20
to meds...@googlegroups.com
What effect size do you mean in the case of coefficients in logistic regession? Significance (I used P<0.001 in view of a very large sample) is only to assess to keep a variable or not - as simple as that. Sensitivity (Sn) and specificity (Sp) is one way to assess, classification accuracy combines both. I already mentioned the prior probability of 0.25 as per the prevalence. The problem I am facing is that the logistic did not improve the classification accuracy despite some significant variables. 

There is no serious collinearity. There is something else that I am missing.

Best.

~A. Indrayan

Bruce Weaver

unread,
Jul 20, 2025, 7:59:32 PMJul 20
to MedStats
Hello Abhaya.  Why would you not just include all 15 explanatory variables in the model?  You have roughly 250,000 events, so with all 15 variables in the model, you would have roughly 1667 events per variable, which is far beyond what the familiar rules of thumb recommend--e.g., see Frank Harrell's 20:1 rule of thumb here:  
Regarding forward selection, see also:
Finally, here is the main link to Harrell's checklist, in case any other parts of it are of interest:
Cheers,
Bruce

Abhaya Indrayan

unread,
Jul 20, 2025, 9:27:41 PMJul 20
to meds...@googlegroups.com
As far as effect size is concerned, I am shouting with the top of my voice to look for medical significance in addition to statistical significance. Please see

Indrayan A. Attack on statistical significance: A balanced approach for medical research. Indian J Med Res 2020;151:275-8 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7371070/ DOI10.4103/ijmr.IJMR_980_19

Indrayan A. The conundrum of P-values: Statistical significance is unavoidable but need medical significance too. J Biostat Epidemiol 2020;5(4):226-34. https://jbe.tums.ac.ir/index.php/jbe/article/view/310  DOIhttps://doi.org/10.18502/jbe.v5i4.386



-

Tzippy Shochat

unread,
Jul 20, 2025, 11:50:09 PMJul 20
to meds...@googlegroups.com
Another question - did you check for missing data? If the regression is not including the entire cohort, this would affect results. Tzippy

‫בתאריך יום ב׳, 21 ביולי 2025 ב-4:27 מאת ‪Abhaya Indrayan‬‏ <‪a.ind...@gmail.com‬‏>:‬
--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Diana Kornbrot

unread,
Jul 21, 2025, 6:50:28 AMJul 21
to meds...@googlegroups.com
Effect sizes for logistic regression
I think Odds Ratio or marginal effect are easies to explain to lay audiences
Odds Ratio (OR)Change in oddsOR = 2 → 2× higher odds
Log-Odds (β)Raw model coefficientβ = 0.69 → OR ≈ 2
Marginal EffectChange in probabilityME = 0.12 → 12% increase in probability
Pseudo R²Model fit

Best

Diana

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Rich Ulrich

unread,
Jul 21, 2025, 11:13:35 AMJul 21
to meds...@googlegroups.com
Sure, overall accuracy combines sensitivity and specificity, and the point about rare diseases is that it is less than useless for some diseases.  If "guesses" match the 3-1 distribution, the expected accuracy is 62.5%:  So your 70% improves on THAT. 

"Prior probability" is a program option in SPSS that does not do what you probably guess.  In general, DO NOT USE IT to match your sample.  In specific, you can give it several options and get SEVERAL sets of estimates of Sn and Sp — which you should do. 

What are your hypotheses?  What-says the prior research?  What is your curiosity?  What is interesting in these data?  If I had this large sample, knowing as little as I do, I would like to confirm homogeneity by age and sex (which usually are relevant).  So I would construct a criterion to use in multi-group discrimination, with maybe 20 groups, 2x2x5 :  +/- Dx;  M/F/;  5 age decades. Then I would look with interest at the plot of centroids on the first two principal components.   

If there is no "interesting" collinearity, you may as well look at the univariate associations alone.  If 'univariate' has all the information, that is what you should present.  Starting with univariate examination should also remind you to consider how 'normal' the distributions are for your variables.  Since outliers can screw up a lot, I always considered what measures deserved or needed transformation.  In my little experience with samples of many thousands, I found that 'bad scaling' could introduce artifacts, like, calling for another variable to be entered in as a suppressor.  (Did you look for suppressors, with signs reversed from the univariate relation?)

One easy variation on size-of-effect that you might consider is a simple one that I've noted in European literature — the sample size needed for the effect to be 'significant' at 0.05.  An F-test on a coefficient is basically R2*N.   If 1% of the observed F would still be significant, then the effect is significant with 1% of the df.  And so on.  An F-test of 500 denotes something better than the similarly "highly significant" F-test of 20.  

Rich Ulrich 

From: meds...@googlegroups.com <meds...@googlegroups.com> on behalf of Abhaya Indrayan <a.ind...@gmail.com>
Sent: Sunday, July 20, 2025 12:11 PM
To: meds...@googlegroups.com <meds...@googlegroups.com>
Subject: Re: {MEDSTATS} Logistic results
 
--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Abhaya Indrayan

unread,
Jul 21, 2025, 12:12:24 PMJul 21
to meds...@googlegroups.com
Thanks. All this is well known. What puzzles me is how this is related to what I asked.

~A. Indrayan

Marc Schwartz

unread,
Jul 21, 2025, 3:03:09 PMJul 21
to MedStats


On Jul 21, 2025, at 12:12 PM, Abhaya Indrayan <a.ind...@gmail.com> wrote:

Thanks. All this is well known. What puzzles me is how this is related to what I asked.

~A. Indrayan


Hi,

We are going to need to know more about your modeling process.

Bruce was right about the issues surrounding the use of stepwise, especially forward selection, suggesting just including all 15 covariates given the quantity of data that you have, and providing relevant links primarily to Frank's resources.


How did you select the 15 initial variables? Was that the totality of the data that you have available, or did you conduct some initial data reduction from a larger data set, hopefully not using univariate pre-filtering?

Are there any relevant covariates, presumably based upon prior research/clinical subject matter expertise, that are not available in your data, thus cannot be included in the model?

Does your model include any interaction terms? 

Do you have any continuous variables and if so, did you consider non-linear transformations on them, such as the use of cubic regression splines?

Are there any missing data that may reduce your evaluable cohort from the 100,000 and bias the model?


There are numerous additional steps that you can consider if your primary concern is trying to improve model fit (discrimination and calibration), as compared to your current reduced model, while getting away from the use of stepwise selection. Those can include the use of LASSO for data reduction if that is warranted, the use of non-linear transforms on continuous covariates, adding interaction terms, and considering the use of what Frank calls "chunk tests" where you are evaluating multiple model terms together especially in the presence of multicollinearity. On the latter, there is a discussion from some time ago here:



Regards,

Marc Schwartz

Michael Cooney

unread,
Jul 26, 2025, 6:48:24 PMJul 26
to MedStats
While I doubt this will make a huge difference, logistic regression is probably not what you want in order to model prevalence, especially when it's not a rare condition.  Maybe try log binomial, and see if it matters.

alejandro munoz

unread,
Aug 5, 2025, 4:43:48 PMAug 5
to meds...@googlegroups.com
Hi Abhaya, 

Sorry for the late feedback. I hope this is still pertinent
- We know little about the context of your investigation. Is this to increase understanding and publish a paper? To deploy the model in practice?
- How did you get your data? Does it represent the spectrum of disease? Although this is not what you ask, I bring it up because the validity and applicability of whatever you generate may hinge on your study design / data collection.
- Related to the above: If your goal is solely predictive, as opposed to inferential, I'd worry less about significance levels.
- Did you split your data into train and test sets? I hope so, and that your stats on correct classification (preferably sensitivity, specificity, PPV, etc.) are based on the test set.
- What are the misclassification costs given the intended use of the model? If this is for screening, depending on work up you'd want a threshold with high sensitivity.
- We also don't know how much feature engineering or pre-processing you did (splines, interactions, imputation, etc. ...), which can greatly impact your model. Again, if prediction is your goal, consider using other methods like LightGBM or XGBoost.

Alejandro

On Sat, Jul 26, 2025 at 5:48 PM Michael Cooney <kastch...@gmail.com> wrote:
While I doubt this will make a huge difference, logistic regression is probably not what you want in order to model prevalence, especially when it's not a rare condition.  Maybe try log binomial, and see if it matters.

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Abhaya Indrayan

unread,
Aug 8, 2025, 4:37:17 PMAug 8
to meds...@googlegroups.com
Thanks to all. Almost all the questions raised are well-known and unrelated to my original query. They are unnecessarily taking up the time of other group members.

My regards to everyone.

~A. Indrayan
--
Dr Abhaya Indrayan, MSc,MS,PhD(OhioState),FSMS,FAMS,FRSS,FASc
Personal website: http://indrayan.weebly.com
 
Sorry for the late feedback. I hope this is still pertinent

Rich Ulrich

unread,
Aug 8, 2025, 6:04:07 PMAug 8
to meds...@googlegroups.com
You write, 
Thanks to all. Almost all the questions raised are well-known and unrelated to my original query. They are unnecessarily taking up the time of other group members.

My regards to everyone.


~A. Indrayan


Your original query (I cut-and-paste, and indent) -
   A 70% correct classification is unacceptable after the logistic, since without logistic the correct classification is 75%.

   What is it that I am doing wrong?
So far as I read in your comments on the 10 replies I saw, you never admitted to learning anything. 

Early on, it was pointed out that "correct classification" is potentially a TERRIBLE criterion, e.g., for rare diseases. 

You did not explain what happens with cutoffs chosen for various sensitivities/specificities, which is what you SHOULD look at. (If there is nothing at all interesting THERE, then, oh, you happen to have uninteresting data.) 

You did not react to my criticism of 'prior probabilities' (a computer-program option — if that is what you meant — that is only a toy that is useful for immediately creating tables with a variety of cutoffs). 

You never mentioned the substance, the disease, what-have-you, that you are looking at.  That's not always needed, but the detail often opens new possibilities since we each travel down our own roads, and some roads are quite different. 

I read a reference you cited for your own paper carried by NIH, talking about effect size and p-levels. I don't know who-abuses-what in journals these days, but people who understand the data and distributions — and sampling — don't require rules of thumb being elevated to rules for publication.  Oh — I think physicists, astronomers and geneticists use some p-levels that are far more extreme than 0.001, which becomes useful when, in effect, millions of 'experiments' are being weighed at once. 

Given that you wrote about using effect sizes, I'm a bit surprised that you did not recognize or react to comments that explained how to get effect sizes from logistic regression.  Even if the prediction table is not useful, the effects of single variables may yet suggest some underlying role in the disease. 

I'm disappointed that you gave us basically zero appreciation, for our efforts that DID address the question you asked, "What am I doing wrong?"  You did not give us enough background for the bigger question, "Is there anything worth publishing in these data?"


Rich Ulrich 




Sent: Tuesday, August 5, 2025 11:56 PM
To: meds...@googlegroups.com <meds...@googlegroups.com>
Subject: Re: {MEDSTATS} Logistic results
 Thanks to all. Almost all the questions raised are well-known and unrelated to my original query. They are unnecessarily taking up the time of other group members.

My regards to everyone.


~A. Indrayan

--

Abhaya Indrayan

unread,
Aug 9, 2025, 12:45:22 AMAug 9
to meds...@googlegroups.com
Thanks. Others may find relevance of all these queries but, yes, I failed to understand their relevance to my question. I fully agree that I am not 'learning'. My question was not on effect size at all. 

 Nice to see that one of my papers caught Rich's attention. Many people disagree with the published results, including my publications. For example, almost everybody uses the area under the ROC curve as an indicator of the PREDICTION accuracy of a model which, in my opinion, is not appropriate (https://journals.lww.com/ijcm/fulltext/9900/assessing_the_adequacy_of_a_prediction_model.157.aspx). We can have a new thread if anybody is interested in discussing this.

I close this thread with thanks to everyone for their time and efforts.

~A. Indrayan
--
Dr Abhaya Indrayan, MSc,MS,PhD(OhioState),FSMS,FAMS,FRSS,FASc
Personal website: http://indrayan.weebly.com

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Jeremy Miles

unread,
Aug 9, 2025, 9:33:17 PMAug 9
to meds...@googlegroups.com
I don't want to pile on here (too much) but I'm going to raise two issues. 

You say "Almost all the questions raised are well-known ...". Everything is well known to people who know it, we don't know what you know and when we tried to help you didn't clarify what you know (or wanted to know).

You also say "I close this thread with thanks to everyone for their time and efforts." I'm afraid that once you start a thread on a mailing list it's not owned by you, and you cannot choose to close it - anyone can continue to discuss it for as long as they please.  (I'm in some lists where threads continue for weeks, and the original poster has checked out because the discussion has moved away from what they asked )



Jeremy 


Message has been deleted

Bruce Weaver

unread,
Aug 15, 2025, 5:33:14 PMAug 15
to MedStats
Jeremy was channeling Robert Rankin there, methinks.  More than once, IIRC, he started sentences with, "Now, it's a fact well known to those who know it well that ...".  For example:  

“Now, it's a fact well known to those who know it well that prophets of doom only attain popularity when they get the drinks in all around.”
Robert Rankin, The Hollow Chocolate Bunnies of the Apocalypse

Abhaya Indrayan

unread,
Aug 16, 2025, 2:50:49 AMAug 16
to meds...@googlegroups.com
This thread refuses to die. Thanks to Jeremy and Bruce for keeping it alive. It is now clear that I cannot close the thread I started. The best I can do is not respond, but in this case, I must. 

All posts are welcome if they help to gain clarity (or even 'popularity').  There is no doubt that my knowledge is limited, and I would like to know how the posted queries addressed my original question. Those interested in my credentials may like to see the book Medical Biostatistics, Fourth Edition (The fifth edition is under process), published by CRC Press and touted by the reviewers as 'the most comprehensive book on biostatistics' and 'encyclopedic in breadth' that induced me to write the Encyclopedia of Biostatistics for Medical Professionals, also published by the CRC Press. Further details are at my website https://indrayan.weebly.com or search Google.

My regards to everyone.

~Abhaya Indrayan

Rich Ulrich

unread,
Aug 19, 2025, 12:17:26 PMAug 19
to meds...@googlegroups.com
Now you say that you would like to know how the posts 'answered your original query.' 

Before, I addressed data problems.  Your query, with a question mark, was, What were you doing wrong.  - Okay, please forgive my bluntness in what follows, but I figure I am addressing 1000 other readers, too, some of whom know much less than you know. 

Here are some points, for what you did wrong, judging from what you posted.  

Multivariable analysis with no rationale.  Look at univariate EFFECTs. What do you hope to learn from the regression? "How many of these warning signs do you have?" is an alternate approach, simpler to apply in practice. 

Stepwise procedure.  These are often disparaged even when there is a rationale for doing steps. My impression is that you might have started out with a lot more than 15 variables offered, and after 15 entries/steps forward, 9 of them still met that 001 criterion.  Instead of stepwise, consider doing multiple analyses.  Age and sex confound MOST disease prevalence rates:  You have ample sample to explore by sex and age-decade, if you don't have any particular hypotheses in mind and want to data-dredge. 

Being upset at achieving 70% classification, compared to the chance-result of 62.5%.  Okay, that is not GREAT discrimination, but that's what you got.  Maybe you can put CIs on how small some effects must be, but do that from looking at the range of OR outcomes seen in (say) geographically-selected subsets.  

Not looking at Odds ratios as measures of effect. In observational studies, an OR of 1.25 is barely suggestive and apt to be worthless, since ORs as large as 1.5 may be achieved by selection artifact, etc. (The classic error at 1.5 was estrogen treatment for menopause, which looked good because the original survey sample, women taking estrogen, was loaded with women who lived longer because they paid a lot of attention to their health: retrospective conclusion after a large controlled study showed elevated mortality from estrogen.)  (Okay, I'm speaking from my reading, not from good personal experience.  I would probably believe a 1.25 OR if the authors convinced me that the problem was simple enough and the analysis was smart enough.)

Here is more explanation of the Classification results.  
The 3:1 ratio used for Predicted and Actual gives a random table that is (9,3; 3;1) in relative frequencies — which yields the 62.5% 'correct' that I cited, (9+1)/16 = 10/16 = 5/8 = 62.5%. 

As it happens, moving '1' to those diagonal cells from the off-diagonals give exactly 75% accuracy, with (10,2; 2,2). That table has an Odds Ratio of 5:1.  This table becomes significant with a total N of under 40:  which defines a 'moderate effect size' in Cohen's description of effects in social science research. That's a huge effect for most medical predictions, when seen for a single variable.  Smoking/cancer has OR effects at 5:1 to 9:1, depending on other confounds.  Smoking/heart disease is more like 2:1.  Second-hand smoke yielded ORs that were in that range of 1.25 to 1.50 which is questionable for observational studies, and thus led to controversy and the need for a LOT of confirmation.  (Randomized-control studies can report smaller effects since they 'control' for the confounds.  Smaller than 1.25? - outside of what I remember reading.) 

Your OR result at 70% (two different cutoffs) is a value like, say, 3.0  — My experience is NOT with huge surveys and data mining, so I do NOT know what to expect from sampling artifacts for your data, to be achieved at random as the result of 'best prediction' from 15 (or more) variables.  

Other data analysis notes: 
You should have hypotheses.  I'm sure your 15 (or whatever) potential predictors are NOT pre-judged as equally likely: Sex and age, for example:  Consider an analysis of ONLY the ones suggested by experience and the literature. If you want to consider them together, think about interactions (if you haven't already). 

And when you look at your predictors that are quantities, think about, "Do these scores represent equal intervals for the criterion, that is, in increasing the likelihood of predicting prevalence?"  Weight, for example, might look like a good distribution (or its square root) without outliers: but morbidly obese and morbidly underweight are both unhealthy conditions. 

Sent: Saturday, August 16, 2025 2:50 AM
--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages