binominal regression: se=T covers full data range

Ingo Miller

unread,

Jul 3, 2021, 11:35:07 AM7/3/21

to ggplot2

Hi all,

I have some dose/response data for which I want to plot a binominal regression with errors. When setting se=T in the geom_smooth function, the error covers the whole data range:

So, here's my data and the ggplot code:

df <- data.frame(x = c(83.3, 41.7, 20.8, 10.4, 5.2, 2.6, 1.3),

y = c(15, 15, 2, 0, 0, 0, 0),

total = c(15, 15, 15, 15, 15, 15, 15))

ggplot(data = df,

aes(x = x, y = y/total)) +

geom_point() +

geom_smooth(method = "glm", method.args = list(family = binomial(link = "logit")), aes(weight=total), se=T, fullrange=F)

I'm

quite new to this and I can't figure out what the problem is. It works fine for my other data sets. The only difference is that for my other data the points don;t lie perfectly on the fitted line as they do in this plot. So the R^2 in this example would be 1. Could this cause problems?

I would appreciate some help here.

Cheers,

Ingo

eipi10

unread,

Jul 3, 2021, 2:48:04 PM7/3/21

to ggplot2

I think there are two issues here: First, logistic regression is appropriate when the outcome can be one of two categories which get dummy-coded as 0 or 1. In your case, the outcome is a proportion that can take on any real value between 0 and 1. I think beta regression is the appropriate method for such data (see, for example, the vignette for the betareg package: https://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf).

Second, in cases where logistic regression is appropriate, a problem that can arise is called "perfect separation." This occurs when there is at least on independent variable that allows you to predict with 100% percent certainty whether the outcome is 0 or 1. In such cases, there is not a unique solution for the regression coefficients, and the standard error for the regression line essentially covers the full range between probabilities of 0 and 1. For example, if you run the code below you'll get a warnings about the fit not converging and that "fitted probabilities numerically 0 or 1 occurred," meaning there was perfect separation. Also, in the summary, note the huge standard errors on the coefficients.

df <- data.frame(x = c(83.3, 41.7, 20.8, 10.4, 5.2, 2.6, 1.3),

y = c(15, 15, 2, 0, 0, 0, 0),

total = c(15, 15, 15, 15, 15, 15, 15))

df$yt = df$y/df$total

m = glm(yt ~ x, weights=total, data=df, family=binomial)

summary(m)

The examples below use data appropriate for logistic regression and demonstrate cases where there is perfect separation and where there is overlap between the two possible outcomes:

# Perfect separation

set.seed(2)

df2 = data.frame(

x = c(runif(20,0,9), runif(20,11,20)),

y = rep(0:1, each=20)

)

ggplot(data = df2,

aes(x,y)) +

geom_point() +

geom_smooth(method = "glm",

method.args = list(family = binomial(link = "logit")))

# Outcomes overlap

set.seed(2)

df3 = data.frame(

x = c(runif(20,0,12), runif(20,8,20)),

y = rep(0:1, each=20)

)

ggplot(data = df3,

aes(x,y)) +

geom_point() +

geom_smooth(method = "glm",

method.args = list(family = binomial(link = "logit")))

Roman Luštrik

unread,

Jul 5, 2021, 2:28:52 AM7/5/21

to Ingo Miller, ggplot2

The large errors you see on the plot make sense.

> mdl <- glm(I(y/total) ~ x, family = "binomial", data = xy)
Warning messages:
1: In eval(family$initialize) : non-integer #successes in a binomial glm!
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(mdl)

Call:
glm(formula = I(y/total) ~ x, family = "binomial", data = xy)

Deviance Residuals:
1 2 3 4 5 6
2.110e-08 2.110e-08 8.600e-09 -2.209e-05 -2.110e-08 -2.110e-08
7
-2.110e-08

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -42.396 77666.343 -0.001 1
x 1.948 3733.959 0.001 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7.8225e+00 on 6 degrees of freedom
Residual deviance: 4.8790e-10 on 5 degrees of freedom
AIC: 4.2862

Number of Fisher Scoring iterations: 24

I wonder what happens if you do (basically from the help file of ?predict.glm):

> mdl <- glm(cbind(xy$y, xy$total) ~ x, family = "binomial", data = xy)

> summary(mdl)

Call:
glm(formula = cbind(xy$y, xy$total) ~ x, family = "binomial",
data = xy)

Deviance Residuals:
1 2 3 4 5 6 7
-1.34652 3.08112 -0.04192 -1.61731 -1.46472 -1.39354 -1.35917

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.808032 0.433658 -6.475 9.47e-11 ***
x 0.039642 0.007582 5.228 1.71e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 53.443 on 6 degrees of freedom
Residual deviance: 19.859 on 5 degrees of freedom
AIC: 34.087

Number of Fisher Scoring iterations: 4

Cheers,

Roman

--
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

---
You received this message because you are subscribed to the Google Groups "ggplot2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ggplot2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ggplot2/1085ba50-038a-42e3-9a02-f95d4c7b83c8n%40googlegroups.com.