Morning all,
I have been fitting some occupancy (and other) models recently using covariates that are highly skewed / zero-inflated. This question is not really realted to unmarked the package, but I feel like the answer, which I am sure some of you will have, will be inforamtive for unamrked analyses (but apologies if i am wrong!).
I am trying to find some guidance on how to treat highly skewed / zero-inflated PREDICTORS in a model, say a landcover type where ~80 % of the data are 0’s and 20% >0. In my head it seems clear to me that the zeros add an enormous amount of noise that detract from estimating a relationship of interest that exists in the non-zero parameter space. My current thinking about this issue would be to model the zeros explicitly:
y_{i} = b_{0} + b_{1} * I(cov=0) + b_{2} * cov + e_{i}
Okay, so in short, I know that there are typically no assumptions about the distribution of the predictors in regression type analyses, so I am wondering, although the predictor has no assumed distribution, does a large amount of zero inflation influence the ability to detect an effect of the non-zero stuff.
If anyone knows of some literature, or recommendations, or rules of thumb etc, relating to this I would be interested to see it.
Chris
Hi Marc,
Thanks for this. Yeah, removing the zero to feel out how they influence the parameter estimates in an exploratory sense also seems like a good idea.
I will keep tabs on any responses, try to summarize them if there are enough and then send round an update.
Until then, happy unmarked’ing everyone!
Chris
Here is something I did recently to do some preliminary feeling out of the issue (make of it what you will – it seems obvious that if I simulate from the ‘hurdle model’ it does better):
#------- start of previous email correspondence code below figure:
I wanted to see whether a regression analysis could recover the two ‘means’ and a single slope pertaining to the z=1 class of the discretization of the continuous variable.
If you are interested, you can execute the following code and see that the model can recover the parameters extremely well.
This makes me think that adding both versions (binary and continuous) of a covariate seems to be okay.
What is clear here is that IF the data are generated under the ‘hurdle’ process, the excess 0’s screw things up when you treat the covariates as continuous – I didn’t scale the covariate, but I can’t see it making the difference??
***disclaimer*** This is a quick and dirty simulation – feel free to tell I am wrong.

set.seed(1)
sims <- 1000
out <- array(NA,c(sims,3,3))
coefs <- c("(Intercept)","bin","cont")
ac <- adjustcolor(c(2,3,4),alpha.f=0.3)
for(i in 1:sims){
bin <- rep(c(0,1),each=100) #binary data
cont <- runif(200,2,20) * bin #”zero-inflated” covariate
mu <- 20 + bin * 50 + 3 * cont
y <- rnorm(length(mu),mu,10)
fm1 <- lm(y ~ bin + cont)
fm2 <- lm(y ~ cont)
fm3 <- lm(y ~ bin)
out[i,,1] <- coef(fm1)[coefs]
out[i,,2] <- coef(fm2)[coefs]
out[i,,3] <- coef(fm3)[coefs]
#print(i)
}
par(mfrow=c(1,3))
boxplot(out[,1,],col=ac, main="Intercept for bin=0. TRUTH=20", las=1)
abline(h=20,col=4,lwd=2)
boxplot(out[,2,],col=ac, main="Intercept for bin=1. TRUTH=50", las=1)
abline(h=50,col=4,lwd=2)
legend("topleft", legend=c("y~bin+cont","y~cont","y~bin"), pch=15,
cex=1.5, col=ac, bg="white", bty="n")
boxplot(out[,3,],col=ac, main="Slope for cont. TRUTH=3", las=1)
abline(h=3,col=4,lwd=2)
--
mu <- 20 + 3 * cont
Hi Mathias,
Thanks for looking into this! You make a great point. However, I simulated data with the following in mind:
· What if the expected value of a response when cont = 0 does NOT equal the intercept of the relationship of the continuous data?
Specifically, I was looking at the following cases (figure below):
Your version is the left figure and your interpretation makes perfect sense regarding bias precision etc… For the particular situation I simulated this data for, which was a little unrelated to my original post but still relevant, I was interested specifically in the second two situations (which may or may not be realistic, but for my particular situation, it was realistic).
bin <- c(rep(0,1900),rep(1,100)) #binary data
cont <- runif(2000,2,20) * bin #”zero-inflated” covariate
mu1 <- 50 + 3 * cont
mu2 <- 50 + bin * 20 + 3 * cont
mu3 <- 50 + bin * -20 + 3 * cont
y1 <- rnorm(length(mu1),mu1,10)
y2 <- rnorm(length(mu2),mu2,10)
y3 <- rnorm(length(mu3),mu3,10)
par(mfrow=c(1,3))
plot(y1~cont,pch=16)
abline(50,3, col=4,lwd=2)
points(y=50,x=0,col=2,cex=2, pch=16)
plot(y2~cont,pch=16)
abline(70,3, col=4,lwd=2)
points(y=50,x=0,col=2,cex=2, pch=16)
plot(y3~cont,pch=16)
abline(30,3, col=4,lwd=2)
points(y=50,x=0,col=2,cex=2, pch=16)

*** Mathias, I realize that this is exactly the point you made in your email (the threshold effect and the model selection), but just wanted to clarify/confirm that the threshold data generating process is what I was particularly interested in.
From: unma...@googlegroups.com [mailto:unma...@googlegroups.com] On Behalf Of Mathias Tobler
Sent: Friday, February 12, 2016 12:15 PM
To: unmarked <unma...@googlegroups.com>