correcting for sampling bias with biased prior and converting to probability of presence

233 views
Skip to first unread message

Pascal Title

unread,
Apr 3, 2017, 10:19:28 PM4/3/17
to Maxent
Hi,

I am interested in accounting for sampling bias by using the biased prior approach. As explained in Merow et al. 2013, "target group sampling locations are provided to MaxEnt as if they represented a single species and the resulting prediction estimates sampling effort. Predictors might include distance to urban centers or roads, elevation, or topographic roughness (Phillips et al. 2009)."
From looking at the associated supplementary material, and again in the supplementary material of Merow et al. 2016, my understanding is that the raw output of the bias model described above is then used as a bias grid in maxent for modeling the species of interest. 
Then the raw output of that focal model is normalized to ensure that the raw output sums to 1.

I'm hoping someone can validate what I am describing here, as I have not been able to find any study that describes this:

I am using maxent via dismo in R (and currently transitioning to the new maxnet R package). For both of these R implementations of Maxent, I can't pass a bias grid. Therefore, I believe that I need to sample pseudo-absences myself, using the probabilities of the bias grid cells to sample locations for those pseudo-absences. So if I wanted to sample 10000 points with probability equal to the values of the cells, I would do:

bg.coords <- xyFromCell(bias, sample(ncell(bias), 10000, prob=values(bias)))

... where `bias` is the name of the raw output bias model raster.

As described, I then build a model with maxent for my focal species, using these selected points as pseudo-absences, take the raw output and then normalize it. By looking at the source code in the maxnet R package, I have defined the following function that would convert raw (exponential) format to cloglog, given the model object and the raw output (which I have normalized):

raw2cloglog <- function(mod, raw) {
 
return(1 - exp(0 - exp(mod$entropy + log(raw[,1]))))
}



Does this sound appropriate? If nothing else this might be useful to others who are interested in doing the same thing...

Thanks!
-Pascal

Pascal Title

unread,
Apr 3, 2017, 11:41:09 PM4/3/17
to Maxent
One aspect I'm not clear on is whether or not you need to divide your raw focal species model output by the raw bias model output, to factor it out before normalizing the result. 

Pascal Title

unread,
Apr 4, 2017, 3:03:01 PM4/4/17
to Maxent
I have created an example of what I am trying to do. It differs a bit from what I initially described, as I figured some parts out while working on this. 

Here, for a species of Australian gecko, I first created a bias grid (raw format), which I built from target group sampling (all gecko species in the same family) and rasters representing distance from roads, and distance from towns/cities. 

I then sampled 10000 pseudo-absences from this grid. 

I produced a maxnet model, and generated the raw output, which I then normalized to ensure that it summed to 1. 

I then factored the bias layer out by dividing my species model by the bias grid, and then normalized the result. 

As I want probability of presence, I then converted the new raw values to cloglog. 

What I am seeing in the result is that regions that have large values in the bias grid have essentially been removed from my model output. Conceptually, I understand that in some way this makes sense as we want to de-emphasize regions that have been heavily sampled relative to regions that have not been well sampled. But this seems to produce a worse model than if I were to not try to remove sampling bias in this way.

Is this right, and I just have data that are not well suited to help remove sampling bias? Or am I misunderstanding some of the steps?

Any help or suggestions would be greatly appreciated! Everything needed to reproduce my example can be downloaded here

Thanks,
-Pascal

Jamie M. Kass

unread,
Apr 5, 2017, 10:48:37 AM4/5/17
to Maxent
Hey Pascal,

If you sample background points using the probability from a raster that estimates the level of sampling bias, you should be able to simply insert these as the background points and run your model without any raster arithmetic afterwards. You shouldn't need to make any conversions afterwards. This is the target sampling group approach.

Not really familiar with what Maxent is doing when you give the GUI a bias file (and I can verify there is a regular dearth of info online about this), but it seems that the only way to do it is through the GUI currently. Looking back on some old posts here, it seems you may be able to specify it through the "args" command in dismo::maxent, but it needs to be in SWD format. Haven't tested it myself though.

Jamie Kass
PhD Candidate
City College of NYC

Pascal Title

unread,
Apr 5, 2017, 12:00:12 PM4/5/17
to Maxent
What you describe is the biased background approach, but I am describing and attempting to implement the biased prior approach. There are some figures in the supplement of Merow et al. 2013. For example, figure E3 shows that in their example, the biased prior approach generates a model that is closer to the known true distribution of the species than the biased background approach. This is further explained in Merow et al. 2016, especially in the supplement of that paper. 

So, we are talking about two related but different approaches to accounting for sampling bias. 

Pascal Title

unread,
Apr 5, 2017, 4:29:22 PM4/5/17
to Maxent
I've decided to work on the implementation using the example dataset employed by Merow et al. 2013 and 2016. This way, we know that we expect to generate an improved model by incorporating a biased prior.

The example dataset can be found here

I get a totally wrong result, but I still cannot figure out where I am going wrong. Hopefully someone is familiar enough with the method to give me some advice!

Thanks,
-Pascal

# Test of biased prior approach to sampling bias with Merow et al. 2016 example dataset

require(raster)
require(maxnet)
require(bossMaps)

dir
<- '~/Merow_et_al_2016_GEB_Minxent_Examples/Appendix_Data'

# function to normalize a raster (make it sum to unity
norm
<- function(surf) {surf / sum(surf, na.rm = TRUE)}

envFiles
<- list.files(paste0(dir, '/App1_Sampling_Bias/NE_ASCII'), pattern='.asc', full.names=TRUE)
env
<- stack(envFiles)

bioEnv
<- env[[grep('bio', names(env), value=TRUE)]]
anthropo
<- env[[c('roads_max', 'pop_max')]]

occ
<- read.csv(paste0(dir, '/App1_Sampling_Bias/CEOR_IPANE_5min_nodup.csv'), stringsAsFactors=FALSE)[,2:3]
targetocc
<- read.csv(paste0(dir, '/App1_Sampling_Bias/Bias_IPANE_allPoints.csv'), stringsAsFactors=FALSE)[,2:3]

###############################
# Create sampling bias model in raw format
bg
.coords <- sampleRandom(anthropo[[1]], 5000, xy=TRUE)[,1:2]
DFenvAbs <- extract(anthropo, bg.coords)
DFenvPres <- extract(anthropo, targetocc)
DFenvAbs <- DFenvAbs[complete.cases(DFenvAbs),]
DFenvPres <- DFenvPres[complete.cases(DFenvPres),]
bin
<- c(rep(1, nrow(DFenvPres)), rep(0, nrow(DFenvAbs)))
DFenv <- as.data.frame(rbind(DFenvPres, DFenvAbs))

projDF
<- as.data.frame(rasterToPoints(anthropo))
projDF
<- projDF[complete.cases(projDF),]
cellsWithData
<- cellFromXY(bioEnv, projDF[, 1:2])

biasmod
<- maxnet(bin, DFenv, f=maxnet.formula(p=bin, data=DFenv, classes='lqp'))
biaspx
<- predict(biasmod, projDF, type='exponential')[,1]
biaspx
<- norm(biaspx)
biaspx
<- rasterFromXYZ(cbind(allPts[,1:2], biaspx), crs = projection(anthropo))

plot
(biaspx, col=hcols(100, bias=1))

##############################################
# Create focal species model for comparison
bg
.coords <- sampleRandom(bioEnv[[1]], 5000, xy=TRUE)[,1:2]
DFenvAbs <- extract(bioEnv, bg.coords)
DFenvPres <- extract(bioEnv, occ)
DFenvAbs <- DFenvAbs[complete.cases(DFenvAbs),]
DFenvPres <- DFenvPres[complete.cases(DFenvPres),]
bin
<- c(rep(1, nrow(DFenvPres)), rep(0, nrow(DFenvAbs)))
DFenv <- as.data.frame(rbind(DFenvPres, DFenvAbs))

projDF
<- as.data.frame(rasterToPoints(bioEnv))
projDF
<- projDF[complete.cases(projDF),]
cellsWithData
<- cellFromXY(bioEnv, projDF[, 1:2])

noBias
<- maxnet(bin, DFenv, f=maxnet.formula(p=bin, data=DFenv, classes='lqp'))
noBiaspx
<- predict(noBias, projDF, type='exponential')[,1]
noBiaspx
<- norm(noBiaspx)
noBiaspx
<- rasterFromXYZ(cbind(projDF[,1:2], noBiaspx), crs = projection(bioEnv))

noBiaspx_cloglog
<- predict(noBias, projDF, type='cloglog')[,1]
noBiaspx_cloglog
<- rasterFromXYZ(cbind(projDF[,1:2], noBiaspx_cloglog), crs = projection(bioEnv))

################################################
# Create focal species model with biased prior
biaspx
[is.na(biaspx)] <- 0

bg
.coords <- xyFromCell(biaspx, sample(ncell(biaspx), 10000, prob=values(biaspx), replace = TRUE))

biaspx
[biaspx == 0] <- NA

# plot(biaspx)
# points(bg.coords, cex=0.05)

DFenvAbs <- extract(bioEnv, bg.coords)
DFenvPres <- extract(bioEnv, occ)
DFenvAbs <- DFenvAbs[complete.cases(DFenvAbs),]
DFenvPres <- DFenvPres[complete.cases(DFenvPres),]
bin
<- c(rep(1, nrow(DFenvPres)), rep(0, nrow(DFenvAbs)))
DFenv <- as.data.frame(rbind(DFenvPres, DFenvAbs))

projDF
<- as.data.frame(rasterToPoints(bioEnv))
projDF
<- projDF[complete.cases(projDF),]
cellsWithData
<- cellFromXY(bioEnv, projDF[, 1:2])

mod
<- maxnet(bin, DFenv, regmult=1, f=maxnet.formula(p=bin, data=DFenv, classes='lqp'))

# generate raw/exponential output
px
<- predict(mod, projDF, type='exponential')[,1]

# normalize
px
<- norm(px)
 
# factor out the bias grid
# This step does not seem to have been done in the code example
# in the appendix of Merow et al. 2016, but I think you are supposed to
px
<- px / values(biaspx)[cellsWithData]

# normalize the result so that it once again sums to unity
px
<- norm(px)
px_raw
<- rasterFromXYZ(cbind(projDF[,1:2], px))

# produce cloglog output
entrop
<- -sum(px * log(px), na.rm=TRUE)
px
<- 1 - exp(0 - exp(entrop + log(px)))
px_cloglog
<- rasterFromXYZ(cbind(projDF[,1:2], px))


par
(mfrow=c(1,3))
plot
(biaspx, col=hcols(100, bias=1), box=FALSE, axes=FALSE)
title
('bias grid')

plot
(noBiaspx, col=hcols(100, bias=2), box=FALSE, axes=FALSE)
title
('no bias correction')

plot
(px_raw, col=hcols(100, bias=1), box=FALSE, axes=FALSE)
title
('with bias correction')




Pascal Title

unread,
Apr 5, 2017, 4:33:59 PM4/5/17
to Maxent
Looks like my code may have gotten cut off, so I've attached my R script this time. 

noBiaspx
<- rasterFromXYZ(cbind(projDF[,1:2], noBiaspx), crs </span
maxnetTest.R

Pascal Title

unread,
Apr 5, 2017, 5:13:17 PM4/5/17
to Maxent
Sorry, found a small typo, attached new version.
maxnetTest.R

Pascal Title

unread,
Apr 11, 2017, 10:12:37 AM4/11/17
to Maxent
To follow up on this and narrow my question:

In the example provided in the Merow et al. 2016 supplement, they use a bias grid (to sample the pseudo-absences), and then they normalize the resulting raw maxent output, and that's how they get their bias-corrected result.

But in the text of the 2016 GEB paper supplement, line 811, it's written that:
"The offset is provided to Maxent in ASCII format (the same format as environmental layers) as a Bias file (under the Advanced tab in Settings). Once the model is run, invoking Minxent simply involves inverting eqn. 5 by multiplying the offset back into the raw output provided by the software, and normalizing."

Again, in the comments on ENMeval, Merow talks about "dividing the raw value by the bias value" to factor out the bias. 

This is where I am confused... where is this post-model-creation manipulation where you multiply the offset back into the raw output? If I divide the raw output of my focal species model by the bias grid, and normalize the result, I basically get holes wherever there was high sampling as per the bias grid, so that doesn't seem to be right. 

I'm having a hard time reconciling what's written with what the provided code is doing. I must be misunderstanding some aspect of this. 
Reply all
Reply to author
Forward
0 new messages