Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

Gabriel Valdez

unread,

Jul 18, 2017, 11:46:47 AM7/18/17

to unmarked

Hello everyone !

I recently posted a question related to a difficulty understanding unmarked’s occupancy output when my categorical predictor (habitat) had one category with 0 observations.

https://groups.google.com/forum/#!topic/unmarked/ho6o8UIk-bo

Turns out I am not supposed to interpret parameters with huge SE. I’m okey with that, I assume that I can still say that, in my case, elephants are less present in reserves M and RW than in reserve R (reference category) thanks to my raw data.

Call:

occu(formula = ~Camera.type ~ Reserve, data = Eleph)

Occupancy:

                  Estimate       SE       z  P(>|z|)

(Intercept)         -1.69    0.335 -5.0408 4.63e-07

ReserveM            -17.46 1398.101 -0.0125 9.90e-01

ReserveRW            -1.68    0.657 -2.5543 1.06e-02

Detection:

                     Estimate    SE     z  P(>|z|)

(Intercept)            -2.418 0.277 -8.73 2.66e-18

Camera.typeCuddeback   -0.686 0.654 -1.05 2.95e-01

AIC: 320.9485

Raw data tells me that there are 0 observations at reserve M, 40 observations at reserve R and 7 observations at reserve RW.

My current best model (selected based on AIC) looks like this:

> best2=occu(~1~Altitude_s+Habitat+Dist.Water.Main.Secon_s+Dist.Main.Secondary.Roads_s+Slope_s+DistPark_s+Nom.Reserve,Eleph)

> best2

Call:

occu(formula = ~1 ~ Altitude_s + Habitat + Dist.Water.Main.Secon_s +

    Dist.Main.Secondary.Roads_s + Slope_s + Distpark_s + Nom.Reserve,

    data = Eleph)

Occupancy:

                            Estimate       SE        z P(>|z|)

(Intercept)                    0.851    1.506  0.56510  0.5720

Altitude_s                     0.813    0.598  1.36078  0.1736

Habitatclosed woodland         3.318    2.450  1.35439  0.1756

Habitatopen woodland          -1.712    1.216 -1.40760  0.1592

Habitatwooded grassland      -11.666  353.071 -0.03304  0.9736

Dist.Water.Main.Secon_s       -0.569    0.525 -1.08362  0.2785

Dist.Main.Secondary.Roads_s    2.087    0.723  2.88611  0.0039

Slope_s                        -0.969    0.628 -1.54347  0.1227

DistPark_s                     3.347    1.319  2.53709  0.0112

Nom.ReserveM                  -19.757 4843.165 -0.00408  0.9967

Nom.ReserveRW                  -10.016    3.630 -2.75878  0.0058

Detection:

 Estimate    SE     z  P(>|z|)

    -2.49 0.251 -9.92 3.47e-23

AIC: 292.7279

I would like to use this model to make a prediction map but before doing that I tried to predict my categorical variables when the other variables are kept constant. These are my predicted values for reserve:

> newData=data.frame(Nom.Reserve=c("M","R","RW"),DistPark_s=rep(0,3),Dist.Water.Main.Secon_s=c(0,0,0),Altitude_s=c(0,0,0),Dist.Main.Secondary.Roads_s=rep(0,3),Pente_s=rep(0,3))

> pred=predict(best3,type="state",newdata=newData,appendData=T)

> pred

Predicted	SE	lower	upper	Nom.Reserve	Distpark_s	Dist.Water.Main.Secon_s
4.24E-01	2.73E-01	7.60E-02	0.86777228	M	0	0
1.48E-07	8.23E-05	0.00E+00	1	R	0	0
5.99E-05	1.43E-04	5.64E-07	0.00632038	RW	0	0

In this prediction, reserve M has the highest occupancy eventhough we saw 0 elephants there and the other 2 reserves have a close to 0 occupancy. This has to be wrong! Even if I model reserve alone, I get similar results.

Does this mean that it is impossible to predict scores when we have too few observations in our categories?

Why the model tells me that occupancy decreases (negative estimate) when we pass from reserve R (reference category) to RW and then it predicts the opposite?? I have observed the same phenomenon for other species. Example with Panthera Pardus:

> best3=occu(~Camera.type~Dist.Water.Main.Secon_s+Nom.Reserve,Pardus)

> best3

Call:

occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve,

    data = Pardus)

Occupancy:

                        Estimate    SE     z P(>|z|)

(Intercept)               -2.065 0.644 -3.21 0.00134

Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101

Nom.ReserveM               2.031 0.878  2.31 0.02068

Nom.ReserveR               1.264 0.768  1.64 0.10007

Detection:

                     Estimate    SE    z  P(>|z|)

(Intercept)            -3.896 0.410 -9.5 2.09e-21

Camera.typeCuddeback   -0.705 0.392 -1.8 7.20e-02

AIC: 453.7909

> newData=data.frame(Nom.Reserve=c("M","R","RW"),Dist.Water.Main.Secon_s=c(0,0,0))

> pred=predict(best3,type="state",newdata=newData,appendData=T)

> pred

Predicted	SE	lower	upper	Nom.Reserve	Dist.Water.Main.Secon_s
0.1125661	0.0643181	0.03466546	0.3094148	M	0
0.4916469	0.1997519	0.16799854	0.8224519	R	0
0.3097557	0.1496749	0.10217138	0.6389486	RW	0

For this case I would have assumed that Reserve M had a higher occupancy than reference reserve R. When I use predict, it looks like the opposite. This reasoning works for other species.

I only get these confusing results for species that were observed less than 50 times across 440 sites (21 days survey). Also, I only have problems with categorical predictors (habitat too).

What should I do?

Thanks in advance!

Gabriel Valdez

Kery Marc

unread,

Jul 18, 2017, 12:06:32 PM7/18/17

to unma...@googlegroups.com

Dear Gabriel,

prediction in regression models can be tricky, especialy when there are correlations among the explanatory variables or when not all combinations of covariate values are observed in nature. For instance, it is easy to obtain predictions that do not make sense, even though they are mathematically correct. But the problem is that they may represent an extrapolation into some part of the covariate space that does not exist in practice.

In your model, you predict for the three reserves and keep the values of all the continuous covariates at the mean observed value 0 (assuming that you normalized these covariates). It is easy then that you get predictions for situations that do not exist in practice, because the values of these continuous covariates will NOT be at the observed mean the three reserves. What you do now is predict for a site that lies in each of these three reserves AND has average values of all the continuous covariates in the model. But presumably that combination of covariate values does not exist anywhere in your study area.

If you want to produce a map, then you must supply in your new data frame the values of each covariate for every pixel in the landscape for which you want to predict.

Best regards --- Marc

From: unma...@googlegroups.com [unma...@googlegroups.com] on behalf of Gabriel Valdez [gabrie...@hotmail.com]
Sent: 18 July 2017 17:46
To: unmarked
Subject: [unmarked] Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gabriel Valdez

unread,

Jul 19, 2017, 5:05:51 AM7/19/17

to unmarked

Dear Marc,

Thank you for your fast reply once again.

I did as you said and used my model to predict occupancy across every pixel of my study area (using ArcGIS). However, the problem persists.

> best2=occu(~1~Altitude_s+Habitat+Dist.Water.Main.Secon_s+Dist.Main.Secondary.Roads_s+Slope_s+Nom.Reserve,Eleph)

> best2

Call:

occu(formula = ~1 ~ Altitude_s + Habitat + Dist.Water.Main.Secon_s +

    Dist.Main.Secondary.Roads_s + Slope_s + Nom.Reserve, data = Eleph)

Occupancy:

                            Estimate      SE       z P(>|z|)

(Intercept)                   -3.545   1.088 -3.2576 0.00112

Altitude_s                     0.857   0.556  1.5400 0.12357

Habitatclosed woodland         3.997   1.609  2.4845 0.01297

Habitatopen grassland          2.149   1.304  1.6483 0.09929

Habitatwooded grassland      -10.073 294.202 -0.0342 0.97269

Dist.Water.Main.Secon_s       -0.561   0.497 -1.1287 0.25901

Dist.Main.Secondary.Roads_s    2.427   0.798  3.0437 0.00234

Slope_s                       -0.843   0.469 -1.7979 0.07219

Nom.ReserveM                  -13.494 392.619 -0.0344 0.97258

Nom.ReserveRW                 -3.308   1.731 -1.9110 0.05601

Detection:

 Estimate    SE     z  P(>|z|)

     -2.5 0.253 -9.88 4.93e-23

AIC: 300.0922

> Eleph.Predmap=predict(best2, type="state", newdata=Predmap, appendData=T)

> aggregate(Eleph.Predmap[,1:4], list(Eleph.Predmap$Nom.Reserve), mean)

  Group.1    Predicted           SE       lower     upper

1   M       8.329646e-02 0.0913347325 0.007149494 0.4972339

2   R       2.340644e-06 0.0009189598 0.000000000 1.0000000

3  RW       6.753009e-02 0.0592211875 0.007940877 0.2765730

I still get a lower occupancy prediction in the reserve where I observed most elephants.

I controlled for multicollinearity using VIF and it seems to me to be okey (no VIF higher than 5).

> vif(lm(Model_ss$Eleph~Model_ss$Altitude_s+Model_ss$Habitat+Model_ss$Dist.Water.Main_s+Model_ss$Dist.Main.Secondary.Roads_s+Model_ss$Slope_s+Model_ss$Nom.Reserve))

                                         GVIF Df GVIF^(1/(2*Df))

Model_ss$Altitude_s                  2.980656  1        1.726458

Model_ss$Habitat                     1.259629  3        1.039219

Model_ss$Dist.Water.Main_s           1.085870  1        1.042051

Model_ss$Dist.Main.Secondary.Roads_s 1.714767  1        1.309491

Model_ss$Slope_s                     1.644706  1        1.282461

Model_ss$Nom.Reserve                 4.757057  2        1.476844

Also, in the model we can see that parameter ReserveRW has a negative estimate. Shouldn’t this mean that reserve RW has a lower “occupancy” than R (reference category)?

I really don't know if I should throw my model to the garbage or not.

Thanks again!

Gabriel

Gabriel Valdez

unread,

Jul 19, 2017, 5:09:57 AM7/19/17

to unmarked

This is what my dataset containing all covariates by pixel looks like:

> head(Predmap)
  FID Join_Count       Habitat            Water_FC Dist_water Dist_Villa                  Villa_FC Dist_culti
1   0          1 open woodland Water\\Rivers major   11322.00   23331.37 Infrastructures\\Villages   16042.56
2   1          1 open woodland Water\\Rivers major   10975.45   23324.18 Infrastructures\\Villages   16294.47
3   2          1 open woodland Water\\Rivers major   11346.64   23824.22 Infrastructures\\Villages   16725.02
4   3          1 open woodland Water\\Rivers major   11727.41   24324.27 Infrastructures\\Villages   17159.34
5   4          1 open woodland Water\\Rivers major   12116.86   24824.31 Infrastructures\\Villages   17597.15
6   5          1 open woodland Water\\Rivers major   12514.17   25324.36 Infrastructures\\Villages   18038.20
  Dist_Park Dist_GameR Dist_road Dist_escar     Slope  Altitude Nom.Reserve DistWatMS            WatMS_Fc
1   33390.10   34129.73  791.6553   8708.317 0.7220203 1291.973       M      11322.00 Water\\Rivers major
2   32890.93   33635.16 1268.3798   8869.866 0.8075468 1288.703       M      10975.45 Water\\Rivers major
3   32864.96   33564.47 1107.7780   8402.630 0.5225660 1290.012       M      11346.64 Water\\Rivers major
4   32846.59   33501.09  943.6142   7939.391 1.1464645 1294.530       M      11727.41 Water\\Rivers major
5   32835.82   33445.07  779.4499   7480.893 1.3533006 1302.229       M      12116.86 Water\\Rivers major
6   32832.67   33396.44  615.2854   7028.064 1.3432235 1310.627       M      12514.17 Water\\Rivers major
  Dist.Water.Main_s DistVillage_s Dist.cultivation_s  DistPark_s Dist.Game.Reserve_s Dist.Main.Secondary.Roads_s
1         0.8522358   -0.27133053         -0.6373598   0.06679590            2.130527                   -1.986802
2         0.8016014   -0.27203736         -0.6094975   0.04647859            2.106300                   -1.815373
3         0.8558062   -0.22313092         -0.5623724   0.04541753            2.102822                   -1.868811
4         0.9104960   -0.17473497         -0.5154452   0.04466649            2.099702                   -1.927613
5         0.9655208   -0.12683387         -0.4687381   0.04422625            2.096941                   -1.991794
6         1.0207530   -0.07941273         -0.4222696   0.04409727            2.094542                   -2.063174
  Dist.Escarpement_s     Slope_s Altitude_s Dist.Water.Main.Secon_s
1          0.7735670 -0.15394253  0.3302090                1.671498
2          0.7980627 -0.08783263  0.3092900                1.612651
3          0.7265855 -0.32534568  0.3176652                1.675647
4          0.6537298  0.14474215  0.3465511                1.739207
5          0.5794949  0.26986292  0.3956533                1.803156
6          0.5039094  0.26399864  0.4490529                1.867346

<w

Kery Marc

unread,

Jul 19, 2017, 6:34:23 AM7/19/17

to unmarked

Dear Gabriel,

I am a little puzzled, but from looking at your pardus results reported earlier (and somewhere at the bottom of this mail), I get this hunch: I think that in R, the levels of a factor must appear in exactly the same order in the factor that you use to fit the model and in the predict statement. Look at your pardus results again:

> best3=occu(~Camera.type~Dist.Water.Main.Secon_s+Nom.Reserve,Pardus)

> best3

Call:

occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve,

    data = Pardus)

Occupancy:

                        Estimate    SE     z P(>|z|)

(Intercept)               -2.065 0.644 -3.21 0.00134

Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101

Nom.ReserveM               2.031 0.878  2.31 0.02068

Nom.ReserveR               1.264 0.768  1.64 0.10007

[.....]

You have different orders of the factor levels. In the fitted model (i.e., in the factor ‘Nom.Reserve’ used to fit the model), the intercept and hence the first (reference) level seems to be RW and then the 2^nd and 3^rd levels are M and R. In contrast, the ordering of the factor levels in ‘Nom.Reserve’ in the predict command is M, R, RW. It seems that predict produces predictions of the reserved IN THE SAME order as in the fitted model, i.e., first comes RW, then M and finally R. Look at this:

> plogis(-2.065)

[1] 0.1125455

This is essentially identical for the prediction for the first level (RW), which, however, there is erroneously labelled M.

Next, the second prediction is that for the 2^nd level IN THE FITTED MODEL (i.e., M)

> plogis(-2.065 + 2.031)

[1] 0.4915008

And, finally, this seems to be the prediction for level R.

> plogis(-2.065 + 1.264)

[1] 0.3098116

Hence, I would try to list the names of the reserves in the Nom.Reserve factor in the predict command in the same order as in the factor used to fit the model and I guess things come out right.

Best regards --- Marc

Gabriel Valdez

unread,

Jul 19, 2017, 8:32:49 AM7/19/17

to unmarked

Thank you Marc,

I'm really really sorry but your assumption comes from one of my dumb mistakes. I manually abbreviated the reserve names of the output and forgot to add a W to the second reserve. R is still the reference.

This is the correct output:

Call:

occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve,

    data = Pardus)

Occupancy:

                        Estimate    SE     z P(>|z|)

(Intercept)               -2.065 0.644 -3.21 0.00134

Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101

Nom.ReserveM               2.031 0.878  2.31 0.02068

Nom.ReserveRW              1.264 0.768  1.64 0.10007

and therefore the prediction is

  Predicted        SE      lower     upper Nom.Reserve Dist.Water.Main.Secon_s
1 0.4916469 0.1997519 0.16799854 0.8224519       R                           0
2 0.1125661 0.0643181 0.03466546 0.3094148       M                           0
3 0.3097557 0.1496749 0.10217138 0.6389486      RW                           0

Even if I change the order in the predict command it won't change the result.

Very sorry for this.

best regards --Gabriel

Kery Marc

unread,

Jul 19, 2017, 8:39:16 AM7/19/17

to unmarked

OK, that beats me then. But I think the fact that predict seems to give the right predictions in the wrong order seems significant. Anybody else any ideas ?

Chris Merkord

unread,

Jul 19, 2017, 9:40:41 AM7/19/17

to unmarked

Just to be clear... It doesn't matter what order the rows are in in your new data frame. It matters what order your factor levels are in, i.e levels(new_data$Nom.Reserve) should equal levels(original_data$Nom.Reserve). Sorry I don't remember your data frame names.

Cheers,
Chris

Gabriel Valdez

unread,

Jul 19, 2017, 10:11:17 AM7/19/17

to unmarked

It worked!!!!!! Thanks a million times!

What i needed to do was to relevel the newData$Nom.Reserve just as i did in the original data before doing the prediction.

newData$Nom.Reserve=relevel(newData$Nom.Reserve,ref="Rukwa")

Marc was right, it was the right prediction but in the wrong order.

I've being struggling with these weird results for days.

You both are my heros, thank you very much.

Jim Baldwin

unread,

Jul 19, 2017, 11:04:29 AM7/19/17

to unma...@googlegroups.com

I just wanted to follow up on your comment "Turns out I am not supposed to interpret parameters with huge SE." That's an incorrect interpretation of what Marc Kery told you. He told you exactly what what the issue is.

The maximum likelihood estimator for that probability is 0 (which translates to minus infinity in the logit scale) and is on the border of the parameter space. (Same thing when the probability is estimated to be 1). In such cases one cannot (using maximum likelihood) estimate the associated estimate of precision. That's just a fact of life with maximum likelihood estimation.

Because unmarked uses iterative methods to find the maximum likelihood estimates, the estimation procedure stops when the function being optimized stops changing by a preset amount. (It's a little more complicated than that but close.) For your data it stopped when the estimate for ReserveM was -17.46 which (as Marc pointed out) corresponds essentially to a probability of zero just as it would be if minus infinity was given as the estimate on the logit scale.

So (to finally make the point) it is the parameter estimate being very negative is what is important and not the size of the reported standard error.

OK. End of sermon.

Jim

--

You received this message because you are subscribed to the Google Groups "unmarked" group.

To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+unsubscribe@googlegroups.com.

Kery Marc

unread,

Jul 19, 2017, 11:14:53 AM7/19/17

to unma...@googlegroups.com

Thanks, Jim, for the clarifications !

Best regards --- Marc

Von: unma...@googlegroups.com [mailto:unma...@googlegroups.com] Im Auftrag von Jim Baldwin
Gesendet: Mittwoch, 19. Juli 2017 17:04
An: unma...@googlegroups.com
Betreff: Re: [unmarked] Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

I just wanted to follow up on your comment "Turns out I am not supposed to interpret parameters with huge SE." That's an incorrect interpretation of what Marc Kery told you. He told you exactly what what the issue is.

--

To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "unmarked" group.

To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.

Gabriel Valdez

unread,

Jul 20, 2017, 2:50:48 AM7/20/17

to unmarked

That helps me a lot. You were right to clarify this.
Thanks a lot.
Gabriel

Reply all

Reply to author

Forward