Variable reduction - overfitting?

1,108 views
Skip to first unread message

Chloe

unread,
Jun 8, 2011, 7:24:46 AM6/8/11
to Maxent
Hello,

I was hoping someone could help me interpret my results. I have built
MaxEnt models in attempt to deal with sampling bias (by including a
bias file describing my sampling effort) and high levels of residual
spatial autocorrelation (by spatially constraining test and training
data and increasing regularisation). I use 5 fold cross-validation,
but ensure that test data are always separated by a threshold
distance. Initial models had poor AUC values, so I started
experimenting with variable reduction, using a method similar to
Parolo et al. (2008; "Toward improved species niche modelling: Arnica
montana in the Alps as a case study"). As I pruned variables which
decreased overall model TEST AUC in a backward stepwise manner,
training AUC decreased by a small amount and test AUC increased by a
large amount (e.g. 0.4 to 0.7) - as you might expect. My final optimal
model set (highest test AUC values) often only consists of a small
number of variable (2 - 4). My models are therefore simple, but they
perform much better than full models in my study region and a new test
region. I am now having difficulty trying to explain why some
variables which appear to have high test AUC when used alone
(jackknife test AUC), decrease the overall model performance. I did
not include any variable pairs that are highly collinear (r>0.7).
Could it be to do with the fact my data points are clustered, so that
the model overfits to certain variables?

Thank you for any help/comments.

Chloe

Lukas Rinnhofer

unread,
Jun 8, 2011, 7:41:01 AM6/8/11
to max...@googlegroups.com
Hi Chloe,

how many sample points are you using?
You always should be aware about the ecological importance of variables for your species. Don't just use any variables available. Try to figure out which variables have the highest impact on your species and then do your variable selection.

As you already mentioned, if your data points are very clustered this could influence the result too.

Hope this helps,

Lukas



2011/6/8 Chloe <bellam...@googlemail.com>

--
You received this message because you are subscribed to the Google Groups "Maxent" group.
To post to this group, send email to max...@googlegroups.com.
To unsubscribe from this group, send email to maxent+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/maxent?hl=en.


Chloe

unread,
Jun 8, 2011, 8:27:28 AM6/8/11
to Maxent
Thanks Lukas,

My models have around 40 to 600 non-duplicate points depending on the
species. However, each species does better with a reduced variable
set, regardless of sample size. My variable selection is somewhat
exploratory, but all have some biological reason for inclusion. I'm
just puzzled as to why a variable can perform well on its own, but not
so well when included with other variables - could it be to do with
the complex interactions that are fitted??

Thanks again,

Chloe

David Galbraith

unread,
Jun 8, 2011, 8:57:25 AM6/8/11
to max...@googlegroups.com
Chloe,

I think you're on to something.  Check out the lambdas output files, they'll give you a better feel for model complexity and how your explanatory variables are used numerically in the final models.  Even if you're only giving Maxent a few variables to play with, if your sample sizes are large enough (I think with more than 80 samples all 'features' are used) Maxent will use all of its 'feature types' (product, hinge, threshold, categorical etc.) and produce some pretty complex models given sample size constraints.

Dave


Chloe

Chloe

unread,
Jun 8, 2011, 10:00:35 AM6/8/11
to Maxent
Hmm... I guess that must be it. The number of parameters produced by
my full set of 16 variables is huge. The models may be overfitting to
particular conditions within each data cluster which reduces overall
model transferability. I hope that this means that the strength and
direction of the relationships I've found between the species and
these variables are still valid - it is just that the interactions
produced when combined with other variables decreases model
performance in other areas. Which makes me wonder whether I should
play around with the feature selection? Or just stick with my
parsimonious set of variables that seem to perform well in new areas?
More to think about!

Thanks for your help, Dave. Any more suggestions/thoughts from anyone
would be much appreciated!

Bruce Miller

unread,
Jun 8, 2011, 10:16:37 AM6/8/11
to max...@googlegroups.com
As always, keep in mind:

NEVER confuse statistical significance with ecological significance.

Do the distribution models make ecological sense even if "Over fitted"
for you critters?
Loads of variables like rain fall are correlated to altitude or
distance from coastlines etc.
But It is what it is and the critters evolved in these "correlated" systems.


--
Bruce W. Miller, Ph.D.
Conservation Ecologist
Neotropical Bat Project


office details
Gallon Jug, Belize
Mailing address
P.O. Box 37, Belize City
Belize, Central America
Phone +501-220-9002


Antonio Trabucco

unread,
Jun 8, 2011, 10:17:38 AM6/8/11
to max...@googlegroups.com
Well,
in my opinion, if the sample distribution is biased also the use of the
threshold feature may contribute to overfitting..

Antonio Trabucco
Forest Ecology and Management
Division Forest, Nature and Landscape
K.U.Leuven

--

David Galbraith

unread,
Jun 8, 2011, 3:09:28 PM6/8/11
to max...@googlegroups.com
You might play with the settings for regularization parameters as well.  I don't quite get the meat of how they function, but I think they're discussed most clearly in the Philips and Dudik 2008 Ecography paper ('modeling of spp. distributions with maxent: new extensions and a comprehensive evaluation').


--

Chloe

unread,
Jun 9, 2011, 5:26:29 AM6/9/11
to Maxent
I have now played around with turning off the threshold and product
features, but I'm not happy with the maps produced. I've already
increased the regularisation multiplier to two as this seemed to
produce the most biologically intuitive, smooth response curves. The
increase in test AUC is large for some species and small for others -
it's always an increase but I can't seem to see a pattern. But the few
other papers I've seen carry out this variable reduction also did find
an initial increase in test AUC (Yost et al. 2008, Hu & Jiang 2010;
Parolo et al. 2008) - so maybe its something that should be tried more
often? Especially if the variable set is large and potentially
experimental. It makes sense to produce the most parsimonious models
which perform well, or better than, a full model. So I'm sticking with
my pruned models which make biological sense, produce good maps and
perform well.

Martin Damus

unread,
Jun 9, 2011, 7:45:14 AM6/9/11
to max...@googlegroups.com
Chloe and others,

I recently came across two papers that may be useful:

Bedia J, Busqué J, Gutiérrez JM. 2011. Predicting plant species distribution
across an alpine rangeland in northern Spain. A comparison of probabilistic
methods. Applied Vegetation Science 2011: 1-18.

They suggest that variable reduction is not useful in Maxent, possibly because
Maxent uses variable interactions in constructing models and these are naturally
lost or reduced when numbers of variables are reduced. Of course you still want
to select your variables with some biological reason (e.g. altitude is rarely a
useful variable because species do not react normally to altitude alone, etc.)

The other one:

Chapman DS. 2010. Weak climatic associations among British plant distributions.
Global Ecology and Biogeography 19: 831-841.

This was interesting in that he showed that one can successfully build
apparently useful models using spurious environmental data. It is the degree of
spatial autocorrelation in the environmental variables and in the species
presence locations that seems to determine when a model works or fails. I won't
pretend to understand it totally, and maybe there are others on this listserver
who could explain the full ramifications. If Chapman is correct, then it seems
that successful application of species distribution models is going to require
some really judicious choices of environmental variables, and simply running the
full Bioclim variable set is going to lead to great models every time -- but due
to variable spatial autocorrelation, not to a functional relationship between
the environmental variables used and the distribution of the organism. Maybe
this is only a problem with the methods used? (GLMs and Random Forests)

Cheers,
Martin Damus

David Galbraith

unread,
Jun 9, 2011, 8:21:44 AM6/9/11
to max...@googlegroups.com
Martin,

Good points.  I hadn't made the connection about the implications of manually removing variables when interaction-based features are at stake.

I glanced at the Chapman paper and need to give it a better read.  It seems like the spatial autocorrelation beast is blamed for many misuses of SDMs, but I'm always left wondering: aren't there mechanistic reasons and histories behind why clusters of species or environmental conditions exist in space?  Didn't species either evolve in or colonize at places where they can exist?  If climate represents a fundamental control on everything from ecosystem essemblages to the geomorphic forms that structure habitat, then its potential future effects on species distributions would be dependent on other controlling variables and potential biological interactions, no?  In any case, I guess that's why folks like doing ecology stuff.

Dave

Chloe

unread,
Jun 11, 2011, 6:11:56 AM6/11/11
to Maxent
Thank you for the very useful papers and comments! I've come to the
decision that a few of my variables are just poor predictors and
shouldn't be included in the full models in the first place (low
jackknife test AUC and gain). Once I run these reduced models overlal
model performance is much better. Should have just done this earlier!

Brunno Oliveira

unread,
Jun 14, 2011, 2:51:30 PM6/14/11
to max...@googlegroups.com
Hi everyone!
Ok Chloe, the AUC analysis could be a useful way to measure model performance. But we know that its not trivial (Jiménez-Valverde 2011; Lobo etal 2008; Rodder 2010 and many). 
I'm with the same doubts that you are. What is the best set of variables to use on my model? 
In many papers it seems that there is no pattern in this kind of choice. Some authors seems to select variable that they think make high ecological sense. Other use some multivariate analyses as PCA, CVA, NMDS... and use that variables best explained by the axes. Other just cite the Worldclim variables chosen, but do not mention about what criteria use to select that. While many use all the 19 Worldclim variables.
In general, seems that we still have not reached an agreement about how to select the set of variable to use in ecological modelling, which autocorrelated variable select e which reject, what is the best way to compare the performance of models with different variable sets.

2011/6/11 Chloe <bellam...@googlemail.com>

--
You received this message because you are subscribed to the Google Groups "Maxent" group.
To post to this group, send email to max...@googlegroups.com.
To unsubscribe from this group, send email to maxent+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/maxent?hl=en.




--
Brunno Freire
Ecólogo
Mestrando PPG Ecologia UFRN
55 84 9981510

Reply all
Reply to author
Forward
0 new messages