zero-inflated / highly skewed PREDICTORS

1,463 views
Skip to first unread message

Chris Sutherland

unread,
Feb 7, 2016, 11:18:23 AM2/7/16
to unmarked

Morning all,


I have been fitting some occupancy (and other) models recently using covariates that are highly skewed / zero-inflated. This question is not really realted to unmarked the package, but I feel like the answer, which I am sure some of you will have, will be inforamtive for unamrked analyses (but apologies if i am wrong!).


I am trying to find some guidance on how to treat highly skewed / zero-inflated PREDICTORS in a model, say a landcover type where ~80 % of the data are 0’s and 20% >0. In my head it seems clear to me that the zeros add an enormous amount of noise that detract from estimating a relationship of interest that exists in the non-zero parameter space. My current thinking about this issue would be to model the zeros explicitly:

 

y_{i} = b_{0} + b_{1} * I(cov=0) + b_{2} * cov + e_{i}

i.e., an intercept for the zeros and a coefficient for the relationship in the non-zero space. I did a small simulation study to see if a simple regression model can recover parameters from data generated under this model and it all checks out, but is it the recommended thing to do? Imagine also if all 5 or 20 or whatever covariates were like this?  

 

Okay, so in short, I know that there are typically no assumptions about the distribution of the predictors in regression type analyses, so I am wondering, although the predictor has no assumed distribution, does a large amount of zero inflation influence the ability to detect an effect of the non-zero stuff. 


If anyone knows of some literature, or recommendations, or rules of thumb etc, relating to this I would be interested to see it. 


Chris


Kery Marc

unread,
Feb 8, 2016, 3:37:16 AM2/8/16
to unma...@googlegroups.com
Dear Chris,

interesting question and I think a quite common problem with landuse data in regression analyses such as occupancy models.

I don't know anything formal about how to treat this but your approach looks interesting to me. In a sense, you treat your landcover predictor as something intermediate between a categorical and a continuous explanatory variable. Sounds good to me.

The other (obvious) thing that comes to my mind when you worry that the preponderance of zeroes affects your ability to detect effects of a predictor in the nonzero range is to simply repeat the analysis with the zeros dropped. But this way you may end up with very few cases (because the zeroes will typically be in different cases for different predictors). Plus, the zeroes aren't missing values, they are zeroes, so throwing them out isn't really the right thing to do either (except for an informal exploratory analysis; that's what I suggest).

So, in summary, I don't know a definite answer either ....

Best regards  --- Marc



From: unma...@googlegroups.com [unma...@googlegroups.com] on behalf of Chris Sutherland [chris...@gmail.com]
Sent: 07 February 2016 17:18
To: unmarked
Subject: [unmarked] zero-inflated / highly skewed PREDICTORS

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris SUtherland

unread,
Feb 9, 2016, 9:14:43 AM2/9/16
to unma...@googlegroups.com

Hi Marc,

 

Thanks for this. Yeah, removing the zero to feel out how they influence the parameter estimates in an exploratory sense also seems like a good idea.

I will keep tabs on any responses, try to summarize them if there are enough and then send round an update.

 

Until then, happy unmarked’ing everyone!

 

Chris

Dan Linden

unread,
Feb 9, 2016, 12:13:52 PM2/9/16
to unmarked
Somebody needs to write a book or monograph called "Gray Areas in Statistics".

Chris, this is definitely an interesting, common, and tricky problem.  From a linear model standpoint, a severe skew in the distribution of your covariate means that certain bins of values (if you were to discretize the measure) have large sample sizes while others have small.  So the problem is that your regression coefficients are potentially going to be poorly estimated, since only a few samples contribute to the slope estimates with those samples having high leverage.

Some kind of transformation would make sense here, I think, and with so many zeros something like a square root might work.  Then your hypothesis is about the multiplicative effect of the covariate, which may or may not make sense.  But I like your covariate set up too, similar to a hurdle model.

laufenbe...@gmail.com

unread,
Feb 10, 2016, 6:34:27 AM2/10/16
to unmarked
Hi Chris and others

I find this to be a timely post because today I encountered this issue with a student's question asked in a workshop I'm co-instructing and I didn't have a clear answer.  Thus far, the student's approach was to standardize the covariate (i.e., % cover type) for an occupancy analysis in which the raw covariate values contained mostly 0% values. I'm not sure whether or not this is an appropriate approach, but am interested in the consequences in taking such an approach.  Perhaps simulating under this scenario could shed some light on it's utility. I will share any insights that come out of working on this with the student, but I would like to hear opinions from others on standardization of "zero-inflated" covariates.

Chris, perhaps cross-posting this question to the hmecology forum would prompt responses from a larger audience.

Jared

Chris SUtherland

unread,
Feb 10, 2016, 8:50:21 AM2/10/16
to unma...@googlegroups.com

Here is something I did recently to do some preliminary feeling out of the issue (make of it what you will – it seems obvious that if I simulate from the ‘hurdle model’ it does better):

 

 

 

#------- start of previous email correspondence code below figure:

 

I wanted to see whether a regression analysis could recover the two ‘means’ and a single slope pertaining to the z=1 class of the discretization of the continuous variable.

If you are interested, you can execute the following code and see that the model can recover the parameters extremely well.

This makes me think that adding both versions (binary and continuous) of a covariate seems to be okay.

 

What is clear here is that IF the data are generated under the ‘hurdle’ process, the excess 0’s screw things up when you treat the covariates as continuous – I didn’t scale the covariate, but I can’t see it making the difference??

 

***disclaimer*** This is a quick and dirty simulation – feel free to tell I am wrong.

 

 

 

set.seed(1)

sims <- 1000

out <- array(NA,c(sims,3,3))

coefs <- c("(Intercept)","bin","cont")

ac <- adjustcolor(c(2,3,4),alpha.f=0.3)

for(i in 1:sims){

bin <- rep(c(0,1),each=100) #binary data

cont <- runif(200,2,20) * bin #”zero-inflated” covariate

mu <- 20 + bin * 50 + 3 * cont

 

y <- rnorm(length(mu),mu,10)

 

fm1 <- lm(y ~ bin + cont)

fm2 <- lm(y ~ cont)

fm3 <- lm(y ~ bin)

 

out[i,,1] <- coef(fm1)[coefs]

out[i,,2] <- coef(fm2)[coefs]

out[i,,3] <- coef(fm3)[coefs]

#print(i)

}

 

par(mfrow=c(1,3))

boxplot(out[,1,],col=ac, main="Intercept for bin=0. TRUTH=20", las=1)

abline(h=20,col=4,lwd=2)

boxplot(out[,2,],col=ac, main="Intercept for bin=1. TRUTH=50", las=1)

abline(h=50,col=4,lwd=2)

legend("topleft", legend=c("y~bin+cont","y~cont","y~bin"), pch=15,

       cex=1.5, col=ac, bg="white", bty="n")

boxplot(out[,3,],col=ac, main="Slope for cont. TRUTH=3", las=1)

abline(h=3,col=4,lwd=2)

--

image001.png

Mathias Tobler

unread,
Feb 12, 2016, 12:15:07 PM2/12/16
to unmarked
Hi Chris,

I took a quick look at your simulation. The cont variable is already zero-inflated so it is not clear to me why you would also include the bin variable in the simulation. If you introduce an additional intercept for all non-zero values it is obvious that the cont model won't give accurate results. If you changed the simulation to only use cont

  mu <- 20 + 3 * cont

it looks like results are unbiased despite the large number of zeroes. The bin+cont model also produced unbiased estimates but with a lower precision for the slope. So I guess this answers your original question. As long as you have enough data with non-zeroes the additional zeroes don't seem affect slope estimates much. Now I do understand that there could be processes that would create a threshold effect like you did  in your simulation. You could probably create two models, one with bin and one without bin and use model selection to decide which one better fits the data.

Mathias

P.S.: I sent this message two days ago but somehow it did not get through. Probably related to our network here in the office.




set.seed(1)

sims <- 1000

out <- array(NA,c(sims,3,3))

coefs <- c("(Intercept)","bin","cont")

ac <- adjustcolor(c(2,3,4),alpha.f=0.3)

 

for(i in 1:sims){

  bin <- rep(c(0,1),each=100) #binary data

  cont <- runif(200,2,20) * bin #”zero-inflated” covariate

 

  mu <- 20 + 3 * cont

Auto Generated Inline Image 1

Chris SUtherland

unread,
Feb 12, 2016, 1:03:40 PM2/12/16
to unma...@googlegroups.com

Hi Mathias,

 

Thanks for looking into this! You make a great point. However, I simulated data with the following in mind:

 

·         What if the expected value of a response when cont = 0 does NOT equal the intercept of the relationship of the continuous data?

 

Specifically, I was looking at the following cases (figure below):

Your version is the left figure and your interpretation makes perfect sense regarding bias precision etc… For the particular situation I simulated this data for, which was a little unrelated to my original post but still relevant, I was interested specifically in the second two situations (which may or may not be realistic, but for my particular situation, it was realistic).

 

    bin <- c(rep(0,1900),rep(1,100)) #binary data

 cont <- runif(2000,2,20) * bin #”zero-inflated” covariate

 

  mu1 <- 50 + 3 * cont

  mu2 <- 50 + bin * 20 + 3 * cont

  mu3 <- 50 + bin * -20 + 3 * cont

  y1 <- rnorm(length(mu1),mu1,10)

  y2 <- rnorm(length(mu2),mu2,10)

  y3 <- rnorm(length(mu3),mu3,10)

  par(mfrow=c(1,3))

  plot(y1~cont,pch=16)

  abline(50,3, col=4,lwd=2)

  points(y=50,x=0,col=2,cex=2, pch=16)

 

  plot(y2~cont,pch=16)

  abline(70,3, col=4,lwd=2)

  points(y=50,x=0,col=2,cex=2, pch=16)

 

  plot(y3~cont,pch=16)

  abline(30,3, col=4,lwd=2)

  points(y=50,x=0,col=2,cex=2, pch=16)

 

 

image001.png
image002.jpg

Chris SUtherland

unread,
Feb 12, 2016, 1:07:59 PM2/12/16
to unma...@googlegroups.com

*** Mathias, I realize that this is exactly the point you made in your email (the threshold effect and the model selection), but just wanted to clarify/confirm that the threshold data generating process is what I was particularly interested in.

 

From: unma...@googlegroups.com [mailto:unma...@googlegroups.com] On Behalf Of Mathias Tobler


Sent: Friday, February 12, 2016 12:15 PM
To: unmarked <unma...@googlegroups.com>

image001.png
Reply all
Reply to author
Forward
0 new messages