binominal regression: se=T covers full data range

164 views
Skip to first unread message

Ingo Miller

unread,
Jul 3, 2021, 11:35:07 AM7/3/21
to ggplot2
Hi all,

I have some dose/response data for which I want to plot a binominal regression with errors. When setting se=T in the geom_smooth function, the error covers the whole data range: 


binominal_error-problem.png

So, here's my data and the ggplot code:

df <- data.frame(x = c(83.3, 41.7, 20.8, 10.4, 5.2, 2.6, 1.3),
                 y = c(15, 15, 2, 0, 0, 0, 0),
                 total = c(15, 15, 15, 15, 15, 15, 15))

ggplot(data = df,
       aes(x = x, y = y/total)) + 
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = binomial(link = "logit")), aes(weight=total),  se=T, fullrange=F)

I'm 
quite new to this and I can't figure out what the problem is. It works fine for my other data sets. The only difference is that for my other data the points don;t lie perfectly on the fitted line as they do in this plot. So the R^2 in this example would be 1. Could this cause problems? 

I would appreciate some help here.

Cheers,
Ingo

eipi10

unread,
Jul 3, 2021, 2:48:04 PM7/3/21
to ggplot2
I think there are two issues here: First, logistic regression is appropriate when the outcome can be one of two categories which get dummy-coded as 0 or 1. In your case, the outcome is a proportion that can take on any real value between 0 and 1. I think beta regression is the appropriate method for such data (see, for example, the vignette for the betareg package: https://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf). 

Second, in cases where logistic regression is appropriate, a problem that can arise is called "perfect separation." This occurs when there is at least on independent variable that allows you to predict with 100% percent certainty whether the outcome is 0 or 1. In such cases, there is not a unique solution for the regression coefficients, and the standard error for the regression line essentially covers the full range between probabilities of 0 and 1. For example, if you run the code below you'll get a warnings about the fit not converging and that "fitted probabilities numerically 0 or 1 occurred," meaning there was perfect separation. Also, in the summary, note the huge standard errors on the coefficients.

df <- data.frame(x = c(83.3, 41.7, 20.8, 10.4, 5.2, 2.6, 1.3),
                 y = c(15, 15, 2, 0, 0, 0, 0),
                 total = c(15, 15, 15, 15, 15, 15, 15))
df$yt = df$y/df$total

m = glm(yt ~ x, weights=total, data=df, family=binomial)
summary(m)

The examples below use data appropriate for logistic regression and demonstrate cases where there is perfect separation and where there is overlap between the two possible outcomes:

# Perfect separation
set.seed(2)
df2 = data.frame(
  x = c(runif(20,0,9), runif(20,11,20)),
  y = rep(0:1, each=20)
)

ggplot(data = df2,
       aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "glm", 
              method.args = list(family = binomial(link = "logit")))

# Outcomes overlap
set.seed(2)
df3 = data.frame(
  x = c(runif(20,0,12), runif(20,8,20)),
  y = rep(0:1, each=20)
)

ggplot(data = df3,
       aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "glm", 
              method.args = list(family = binomial(link = "logit")))

Roman Luštrik

unread,
Jul 5, 2021, 2:28:52 AM7/5/21
to Ingo Miller, ggplot2
The large errors you see on the plot make sense.

> mdl <- glm(I(y/total) ~ x, family = "binomial", data = xy)
Warning messages:
1: In eval(family$initialize) : non-integer #successes in a binomial glm!
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(mdl)

Call:
glm(formula = I(y/total) ~ x, family = "binomial", data = xy)

Deviance Residuals:
         1           2           3           4           5           6  
 2.110e-08   2.110e-08   8.600e-09  -2.209e-05  -2.110e-08  -2.110e-08  
         7  
-2.110e-08  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   -42.396  77666.343  -0.001        1
x               1.948   3733.959   0.001        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7.8225e+00  on 6  degrees of freedom
Residual deviance: 4.8790e-10  on 5  degrees of freedom
AIC: 4.2862

Number of Fisher Scoring iterations: 24



I wonder what happens if you do (basically from the help file of ?predict.glm):

> mdl <- glm(cbind(xy$y, xy$total) ~ x, family = "binomial", data = xy)
> summary(mdl)

Call:
glm(formula = cbind(xy$y, xy$total) ~ x, family = "binomial",
    data = xy)

Deviance Residuals:
       1         2         3         4         5         6         7  
-1.34652   3.08112  -0.04192  -1.61731  -1.46472  -1.39354  -1.35917  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.808032   0.433658  -6.475 9.47e-11 ***
x            0.039642   0.007582   5.228 1.71e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 53.443  on 6  degrees of freedom
Residual deviance: 19.859  on 5  degrees of freedom
AIC: 34.087

Number of Fisher Scoring iterations: 4

Cheers,
Roman



--
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

---
You received this message because you are subscribed to the Google Groups "ggplot2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ggplot2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ggplot2/1085ba50-038a-42e3-9a02-f95d4c7b83c8n%40googlegroups.com.


--
In God we trust, all others bring data.
Reply all
Reply to author
Forward
0 new messages