Mean imputation for missing covariates

35 views
Skip to first unread message

Sara Williams

unread,
Jul 30, 2024, 8:36:20 PMJul 30
to spOccupancy and spAbundance users
Hi Jeff,

Thanks so much for creating this group!

I'm hoping this is a straightfoward question (though one I've often found confusing!) 

Using stPGOcc() on a data set with lots of sites (~900 in this particular case), 10 primary sampling occasions (years) and 15 secondary sampling occasions within years, we've ended up with many NAs (sites were often only sampled a couple to a few years out of the total 10 possible). For site covariates, we have several that do not vary across years (elevation, ruggedness, etc) and several that do vary across years. In our data set, where there is an NA in y, there is also an NA in any of our site covariates that vary across years (covariate values were not collected when the site was not surveyed). 

I know that we cannot have NAs in data.list$occ.covs, and when I do leave the NAs in there, the error message suggests mean imputation. To do the imputation, I've been unsure if I should use an overall mean (across all sites and years per covariate), a row mean (the mean for a given site across all years the covariate value is availble), or a column mean (the mean for a given year across all sites). Any thoughts on the best approach and/or if it actually makes any difference given these match up with NAs in data.list$y? I should also mention I'm scaling the covariates within the occ.formula argument (not ahead of time). 

Thanks very much - and I think congrats are in order!! (Just noticed the different email address and signature)

Take care,
Sara Williams

Jeffrey Doser

unread,
Jul 31, 2024, 6:35:05 AMJul 31
to Sara Williams, spOccupancy and spAbundance users
Hi Sara,

Thanks for the note and great question! I agree that this is a situation where that error message is a bit confusing. Before stPGOcc() runs the model, it will create a new data object that gets rid of all site/year/visit combinations in "data$y" for which there is an NA value. Because the situations where you have NA values in the site-level covariates lines up with situations where you don't have observed detection-nondetection data, the values you put in there won't influence the actual model fitting process (i.e., the parameter estimates won't change). However, for all multi-season models in spOccupancy, the model will predict and return occurrence probability (psi) and latent occurrence (z) for all combinations of site and year, regardless of if there was any data used to fit the model. So, in the "psi.samples" and "z.samples" components of the resulting object from "stPGOcc()", there will be predictions for the unsampled site/year combinations, and those inherently depend on the covariate values that you supply to "occ.covs". If you aren't super interested in the "psi.samples" or "z.samples" at the non-sampled site/year combinations, then you can just replace the NA values in "occ.covs" with some non-NA value and then just make sure you don't interpret those site/year combinations in any subsequent interpretation of those when you're summarizing the results from the model.

Now in a situation where you did want to get the estimates at non-sampled site/year combinations (or in a situation where you had missing covariate values at site/years that were actually surveyed), it does become relevant how you impute the missing value into the model. My perspective on this is that there isn't one way that works best in all situations and rather it depends on the specific covariate that you're working with. I would just say you should do the approach that you think will most accurately represent the covariate at those locations (and of course acknowledging that you did impute the covariate value if you end up using predictions at that site/year combination). For example, if you have a site/year level covariate with missing values in some years, it might be best to use the mean across that site (row mean) if there is very little change in the covariate over the years and you expect there to be substantial correlation in the values at the site across different years. Alternatively, if you had some covariate like "time of day" as a survey-level covariate on detection probability and it wasn't always recorded, it could be better to use the overall mean value if the time the survey takes place is fairly random and not tied to the specific site. You should calculate the means using the values that you supply in "occ.covs" (i.e., in your case use the means on the actual scale of the covariate since you're standardizing in the formula).

Let me know if I can clarify any of that!

Cheers,

Jeff

--
You received this message because you are subscribed to the Google Groups "spOccupancy and spAbundance users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spocc-spabund-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spocc-spabund-users/1b38f3ee-fc98-431b-b892-74831f3778ban%40googlegroups.com.


--
Jeffrey W. Doser, Ph.D.
Assistant Professor
Department of Forestry and Environmental Resources
North Carolina State University
Pronouns: he/him/his

Sara Williams

unread,
Aug 2, 2024, 11:30:45 AMAug 2
to Jeffrey Doser, spOccupancy and spAbundance users

Hi Jeff,

Thanks so much the detailed reply. Very helpful and much appreciated! (Btw didn't mean to say the error message was confusing - it was helpful to have the direction from it.  I meant which values to use for mean imputation has been the confusing part for me in the past).

Thanks again!

Sara

Reply all
Reply to author
Forward
0 new messages