Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

228 views
Skip to first unread message

Gabriel Valdez

unread,
Jul 18, 2017, 11:46:47 AM7/18/17
to unmarked

Hello everyone !


I recently posted a question related to a difficulty understanding unmarked’s occupancy output when my categorical predictor (habitat) had one category with 0 observations.

https://groups.google.com/forum/#!topic/unmarked/ho6o8UIk-bo

 

Turns out I am not supposed to interpret parameters with huge SE. I’m okey with that, I assume that I can still say that, in my case, elephants are less present in reserves M and RW than in reserve R (reference category) thanks to my raw data.

Call:
occu(formula = ~Camera.type ~ Reserve, data = Eleph)
 
Occupancy:
                  Estimate       SE       z  P(>|z|)
(Intercept)         -1.69    0.335 -5.0408 4.63e-07
ReserveM            -17.46 1398.101 -0.0125 9.90e-01
ReserveRW            -1.68    0.657 -2.5543 1.06e-02
 
Detection:
                     Estimate    SE     z  P(>|z|)
(Intercept)            -2.418 0.277 -8.73 2.66e-18
Camera.typeCuddeback   -0.686 0.654 -1.05 2.95e-01
 
AIC: 320.9485 

 

Raw data tells me that there are 0 observations at reserve M, 40 observations at reserve R and 7 observations at reserve RW.


My current best model (selected based on AIC) looks like this:

> best2=occu(~1~Altitude_s+Habitat+Dist.Water.Main.Secon_s+Dist.Main.Secondary.Roads_s+Slope_s+DistPark_s+Nom.Reserve,Eleph) 
> best2
 
Call:
occu(formula = ~1 ~ Altitude_s + Habitat + Dist.Water.Main.Secon_s + 
    Dist.Main.Secondary.Roads_s + Slope_s + Distpark_s + Nom.Reserve, 
    data = Eleph)
 
Occupancy:
                            Estimate       SE        z P(>|z|)
(Intercept)                    0.851    1.506  0.56510  0.5720
Altitude_s                     0.813    0.598  1.36078  0.1736
Habitatclosed woodland         3.318    2.450  1.35439  0.1756
Habitatopen woodland          -1.712    1.216 -1.40760  0.1592
Habitatwooded grassland      -11.666  353.071 -0.03304  0.9736
Dist.Water.Main.Secon_s       -0.569    0.525 -1.08362  0.2785
Dist.Main.Secondary.Roads_s    2.087    0.723  2.88611  0.0039
Slope_s                        -0.969    0.628 -1.54347  0.1227
DistPark_s                     3.347    1.319  2.53709  0.0112
Nom.ReserveM                  -19.757 4843.165 -0.00408  0.9967
Nom.ReserveRW                  -10.016    3.630 -2.75878  0.0058
 
Detection:
 Estimate    SE     z  P(>|z|)
    -2.49 0.251 -9.92 3.47e-23
 
AIC: 292.7279 

 

I would like to use this model to make a prediction map but before doing that I tried to predict my categorical variables when the other variables are kept constant. These are my predicted values for reserve:

> newData=data.frame(Nom.Reserve=c("M","R","RW"),DistPark_s=rep(0,3),Dist.Water.Main.Secon_s=c(0,0,0),Altitude_s=c(0,0,0),Dist.Main.Secondary.Roads_s=rep(0,3),Pente_s=rep(0,3))
> pred=predict(best3,type="state",newdata=newData,appendData=T)
> pred
 

Predicted

SE

lower

upper

Nom.Reserve

Distpark_s

Dist.Water.Main.Secon_s

4.24E-01

2.73E-01

7.60E-02

0.86777228

M

0

0

1.48E-07

8.23E-05

0.00E+00

1

R

0

0

5.99E-05

1.43E-04

5.64E-07

0.00632038

RW

0

0

 

In this prediction, reserve M has the highest occupancy eventhough we saw 0 elephants there and the other 2 reserves have a close to 0 occupancy. This has to be wrong!  Even if I model reserve alone, I get similar results.


Does this mean that it is impossible to predict scores when we have too few observations in our categories?


Why the model tells me that occupancy decreases (negative estimate) when we pass from reserve R (reference category) to RW and then it predicts the opposite?? I have observed the same phenomenon for other species. Example with Panthera Pardus:

> best3=occu(~Camera.type~Dist.Water.Main.Secon_s+Nom.Reserve,Pardus)
> best3
 
Call:
occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve, 
    data = Pardus)
 
Occupancy:
                        Estimate    SE     z P(>|z|)
(Intercept)               -2.065 0.644 -3.21 0.00134
Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101
Nom.ReserveM               2.031 0.878  2.31 0.02068
Nom.ReserveR               1.264 0.768  1.64 0.10007
 
Detection:
                     Estimate    SE    z  P(>|z|)
(Intercept)            -3.896 0.410 -9.5 2.09e-21
Camera.typeCuddeback   -0.705 0.392 -1.8 7.20e-02
 
AIC: 453.7909 

 

 

> newData=data.frame(Nom.Reserve=c("M","R","RW"),Dist.Water.Main.Secon_s=c(0,0,0))
> pred=predict(best3,type="state",newdata=newData,appendData=T)
> pred

Predicted

SE

lower

upper

Nom.Reserve

Dist.Water.Main.Secon_s

0.1125661

0.0643181

0.03466546

0.3094148

M

0

0.4916469

0.1997519

0.16799854

0.8224519

R

0

0.3097557

0.1496749

0.10217138

0.6389486

RW

0

 

For this case I would have assumed that Reserve M had a higher occupancy than reference reserve R. When I use predict, it looks like the opposite. This reasoning works for other species.


I only get these confusing results for species that were observed less than 50 times across 440 sites (21 days survey). Also, I only have problems with categorical predictors (habitat too).


What should I do?


Thanks in advance!


Gabriel Valdez

Kery Marc

unread,
Jul 18, 2017, 12:06:32 PM7/18/17
to unma...@googlegroups.com
Dear Gabriel,

prediction in regression models can be tricky, especialy when there are correlations among the explanatory variables or when not all combinations of covariate values are observed in nature. For instance, it is easy to obtain predictions that do not make sense, even though they are mathematically correct. But the problem is that they may represent an extrapolation into some part of the covariate space that does not exist in practice.

In your model, you predict for the three reserves and keep the values of all the continuous covariates at the mean observed value 0 (assuming that you normalized these covariates). It is easy then that you get predictions for situations that do not exist in practice, because the values of these continuous covariates will NOT be at the observed mean the three reserves. What you do now is predict for a site that lies in each of these three reserves AND has average values of all the continuous covariates in the model. But presumably that combination of covariate values does not exist anywhere in your study area.

If you want to produce a map, then you must supply in your new data frame the values of each covariate for every pixel in the landscape for which you want to predict.

Best regards  --- Marc




From: unma...@googlegroups.com [unma...@googlegroups.com] on behalf of Gabriel Valdez [gabrie...@hotmail.com]
Sent: 18 July 2017 17:46
To: unmarked
Subject: [unmarked] Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gabriel Valdez

unread,
Jul 19, 2017, 5:05:51 AM7/19/17
to unmarked

Dear Marc,

Thank you for your fast reply once again.

I did as you said and used my model to predict occupancy across every pixel of my study area (using ArcGIS). However, the problem persists.

 

> best2=occu(~1~Altitude_s+Habitat+Dist.Water.Main.Secon_s+Dist.Main.Secondary.Roads_s+Slope_s+Nom.Reserve,Eleph)

> best2
 
Call:
occu(formula = ~1 ~ Altitude_s + Habitat + Dist.Water.Main.Secon_s + 
    Dist.Main.Secondary.Roads_s + Slope_s + Nom.Reserve, data = Eleph)
 
Occupancy:
                            Estimate      SE       z P(>|z|)
(Intercept)                   -3.545   1.088 -3.2576 0.00112
Altitude_s                     0.857   0.556  1.5400 0.12357
Habitatclosed woodland         3.997   1.609  2.4845 0.01297
Habitatopen grassland          2.149   1.304  1.6483 0.09929
Habitatwooded grassland      -10.073 294.202 -0.0342 0.97269
Dist.Water.Main.Secon_s       -0.561   0.497 -1.1287 0.25901
Dist.Main.Secondary.Roads_s    2.427   0.798  3.0437 0.00234
Slope_s                       -0.843   0.469 -1.7979 0.07219
Nom.ReserveM                  -13.494 392.619 -0.0344 0.97258
Nom.ReserveRW                 -3.308   1.731 -1.9110 0.05601
 
Detection:
 Estimate    SE     z  P(>|z|)
     -2.5 0.253 -9.88 4.93e-23
 
AIC: 300.0922 

 

> Eleph.Predmap=predict(best2, type="state", newdata=Predmap, appendData=T)

 

> aggregate(Eleph.Predmap[,1:4], list(Eleph.Predmap$Nom.Reserve), mean)
  Group.1    Predicted           SE       lower     upper
1   M       8.329646e-02 0.0913347325 0.007149494 0.4972339
2   R       2.340644e-06 0.0009189598 0.000000000 1.0000000
3  RW       6.753009e-02 0.0592211875 0.007940877 0.2765730

 

I still get a lower occupancy prediction in the reserve where I observed most elephants.

I controlled for multicollinearity using VIF and it seems to me to be okey (no VIF higher than 5).

 

> vif(lm(Model_ss$Eleph~Model_ss$Altitude_s+Model_ss$Habitat+Model_ss$Dist.Water.Main_s+Model_ss$Dist.Main.Secondary.Roads_s+Model_ss$Slope_s+Model_ss$Nom.Reserve))
                                         GVIF Df GVIF^(1/(2*Df))
Model_ss$Altitude_s                  2.980656  1        1.726458
Model_ss$Habitat                     1.259629  3        1.039219
Model_ss$Dist.Water.Main_s           1.085870  1        1.042051
Model_ss$Dist.Main.Secondary.Roads_s 1.714767  1        1.309491
Model_ss$Slope_s                     1.644706  1        1.282461
Model_ss$Nom.Reserve                 4.757057  2        1.476844

 

 

Also, in the model we can see that parameter ReserveRW has a negative estimate. Shouldn’t this mean that reserve RW has a lower “occupancy” than R (reference category)?


I really don't know if I should throw my model to the garbage or not.

 

Thanks again!

 

Gabriel

Gabriel Valdez

unread,
Jul 19, 2017, 5:09:57 AM7/19/17
to unmarked
This is what my dataset containing all covariates by pixel looks like:

> head(Predmap)
  FID Join_Count       Habitat            Water_FC Dist_water Dist_Villa                  Villa_FC Dist_culti
1   0          1 open woodland Water\\Rivers major   11322.00   23331.37 Infrastructures\\Villages   16042.56
2   1          1 open woodland Water\\Rivers major   10975.45   23324.18 Infrastructures\\Villages   16294.47
3   2          1 open woodland Water\\Rivers major   11346.64   23824.22 Infrastructures\\Villages   16725.02
4   3          1 open woodland Water\\Rivers major   11727.41   24324.27 Infrastructures\\Villages   17159.34
5   4          1 open woodland Water\\Rivers major   12116.86   24824.31 Infrastructures\\Villages   17597.15
6   5          1 open woodland Water\\Rivers major   12514.17   25324.36 Infrastructures\\Villages   18038.20
  Dist_Park Dist_GameR Dist_road Dist_escar     Slope  Altitude Nom.Reserve DistWatMS            WatMS_Fc
1   33390.10   34129.73  791.6553   8708.317 0.7220203 1291.973       M      11322.00 Water\\Rivers major
2   32890.93   33635.16 1268.3798   8869.866 0.8075468 1288.703       M      10975.45 Water\\Rivers major
3   32864.96   33564.47 1107.7780   8402.630 0.5225660 1290.012       M      11346.64 Water\\Rivers major
4   32846.59   33501.09  943.6142   7939.391 1.1464645 1294.530       M      11727.41 Water\\Rivers major
5   32835.82   33445.07  779.4499   7480.893 1.3533006 1302.229       M      12116.86 Water\\Rivers major
6   32832.67   33396.44  615.2854   7028.064 1.3432235 1310.627       M      12514.17 Water\\Rivers major
  Dist.Water.Main_s DistVillage_s Dist.cultivation_s  DistPark_s Dist.Game.Reserve_s Dist.Main.Secondary.Roads_s
1         0.8522358   -0.27133053         -0.6373598   0.06679590            2.130527                   -1.986802
2         0.8016014   -0.27203736         -0.6094975   0.04647859            2.106300                   -1.815373
3         0.8558062   -0.22313092         -0.5623724   0.04541753            2.102822                   -1.868811
4         0.9104960   -0.17473497         -0.5154452   0.04466649            2.099702                   -1.927613
5         0.9655208   -0.12683387         -0.4687381   0.04422625            2.096941                   -1.991794
6         1.0207530   -0.07941273         -0.4222696   0.04409727            2.094542                   -2.063174
  Dist.Escarpement_s     Slope_s Altitude_s Dist.Water.Main.Secon_s
1          0.7735670 -0.15394253  0.3302090                1.671498
2          0.7980627 -0.08783263  0.3092900                1.612651
3          0.7265855 -0.32534568  0.3176652                1.675647
4          0.6537298  0.14474215  0.3465511                1.739207
5          0.5794949  0.26986292  0.3956533                1.803156
6          0.5039094  0.26399864  0.4490529                1.867346
<w

Kery Marc

unread,
Jul 19, 2017, 6:34:23 AM7/19/17
to unmarked

Dear Gabriel,

 

I am a little puzzled, but from looking at your pardus results reported earlier (and somewhere at the bottom of this mail), I get this hunch: I think that in R, the levels of a factor must appear in exactly the same order in the factor that you use to fit the model and in the predict statement. Look at your pardus results again:

 

> best3=occu(~Camera.type~Dist.Water.Main.Secon_s+Nom.Reserve,Pardus)
> best3
 
Call:
occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve, 
    data = Pardus)
 
Occupancy:
                        Estimate    SE     z P(>|z|)
(Intercept)               -2.065 0.644 -3.21 0.00134
Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101
Nom.ReserveM               2.031 0.878  2.31 0.02068
Nom.ReserveR               1.264 0.768  1.64 0.10007
 
[.....]

You have different orders of the factor levels. In the fitted model (i.e., in the factor ‘Nom.Reserve’ used to fit the model), the intercept and hence the first (reference) level seems to be RW and then the 2nd and 3rd levels are M and R. In contrast, the ordering of the factor levels in ‘Nom.Reserve’ in the predict command is M, R, RW. It seems that predict produces predictions of the reserved IN THE SAME order as in the fitted model, i.e., first comes RW, then M and finally R. Look at this:

 

> plogis(-2.065)

[1] 0.1125455

 

This is essentially identical for the prediction for the first level (RW), which, however, there is erroneously labelled M.

 

Next, the second prediction is that for the 2nd level IN THE FITTED MODEL (i.e., M)

> plogis(-2.065 + 2.031)

[1] 0.4915008

 

And, finally, this seems to be the prediction for level R.

> plogis(-2.065 + 1.264)

[1] 0.3098116

 

Hence, I would try to list the names of the reserves in the Nom.Reserve factor in the predict command in the same order as in the factor used to fit the model and I guess things come out right.

 

Best regards  --- Marc

Gabriel Valdez

unread,
Jul 19, 2017, 8:32:49 AM7/19/17
to unmarked
Thank you Marc,

I'm really really sorry but your assumption comes from one of my dumb mistakes. I manually abbreviated the reserve names of the output and forgot to add a W to the second reserve. R is still the reference.

This is the correct output:


Call:
occu(formula = ~Camera.type ~ Dist.Water.Main.Secon_s + Nom.Reserve, 
    data = Pardus)
 
Occupancy:
                        Estimate    SE     z P(>|z|)
(Intercept)               -2.065 0.644 -3.21 0.00134
Dist.Water.Main.Secon_s   -0.883 0.347 -2.54 0.01101
Nom.ReserveM               2.031 0.878  2.31 0.02068
Nom.ReserveRW              1.264 0.768  1.64 0.10007


and therefore the prediction is

  Predicted        SE      lower     upper Nom.Reserve Dist.Water.Main.Secon_s
1 0.4916469 0.1997519 0.16799854 0.8224519       R                           0
2 0.1125661 0.0643181 0.03466546 0.3094148       M                           0
3 0.3097557 0.1496749 0.10217138 0.6389486      RW                           0

Even if I change the order in the predict command it won't change the result.

Very sorry for this.

best regards --Gabriel

Kery Marc

unread,
Jul 19, 2017, 8:39:16 AM7/19/17
to unmarked

OK, that beats me then. But I think the fact that predict seems to give the right predictions in the wrong order seems significant. Anybody else any ideas ?

Chris Merkord

unread,
Jul 19, 2017, 9:40:41 AM7/19/17
to unmarked
Just to be clear... It doesn't matter what order the rows are in in your new data frame. It matters what order your factor levels are in, i.e levels(new_data$Nom.Reserve) should equal levels(original_data$Nom.Reserve). Sorry I don't remember your data frame names.

Cheers,
Chris

Gabriel Valdez

unread,
Jul 19, 2017, 10:11:17 AM7/19/17
to unmarked
It worked!!!!!! Thanks a million times!

What i needed to do was to relevel the newData$Nom.Reserve just as i did in the original data before doing the prediction.

newData$Nom.Reserve=relevel(newData$Nom.Reserve,ref="Rukwa")


Marc was right, it was the right prediction but in the wrong order.

I've being struggling with these weird results for days.

You both are my heros, thank you very much.

Jim Baldwin

unread,
Jul 19, 2017, 11:04:29 AM7/19/17
to unma...@googlegroups.com
I just wanted to follow up on your comment "Turns out I am not supposed to interpret parameters with huge SE."  That's an incorrect interpretation of what Marc Kery told you.  He told you exactly what what the issue is.

The maximum likelihood estimator for that probability is 0 (which translates to minus infinity in the logit scale) and is on the border of the parameter space.  (Same thing when the probability is estimated to be 1).  In such cases one cannot (using maximum likelihood) estimate the associated estimate of precision.  That's just a fact of life with maximum likelihood estimation.

Because unmarked uses iterative methods to find the maximum likelihood estimates, the estimation procedure stops when the function being optimized stops changing by a preset amount.  (It's a little more complicated than that but close.)  For your data it stopped when the estimate for ReserveM was -17.46 which (as Marc pointed out) corresponds essentially to a probability of zero just as it would be if minus infinity was given as the estimate on the logit scale.

So (to finally make the point) it is the parameter estimate being very negative is what is important and not the size of the reported standard error.

OK.  End of sermon.

Jim

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+unsubscribe@googlegroups.com.

Kery Marc

unread,
Jul 19, 2017, 11:14:53 AM7/19/17
to unma...@googlegroups.com

Thanks, Jim, for the clarifications !

 

Best regards  --- Marc

 

 

Von: unma...@googlegroups.com [mailto:unma...@googlegroups.com] Im Auftrag von Jim Baldwin
Gesendet: Mittwoch, 19. Juli 2017 17:04
An: unma...@googlegroups.com
Betreff: Re: [unmarked] Wrong prediction output? Is it possible to predict occupancy when a predictor's category has 0 observations?

 

I just wanted to follow up on your comment "Turns out I am not supposed to interpret parameters with huge SE."  That's an incorrect interpretation of what Marc Kery told you.  He told you exactly what what the issue is.

--

To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

 

--

You received this message because you are subscribed to the Google Groups "unmarked" group.

To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.

Gabriel Valdez

unread,
Jul 20, 2017, 2:50:48 AM7/20/17
to unmarked
That helps me a lot. You were right to clarify this.
 Thanks a lot.
Gabriel
Reply all
Reply to author
Forward
0 new messages