Variable reduction using Maxent?

1,396 views
Skip to first unread message

Mark

unread,
Mar 18, 2009, 6:48:54 PM3/18/09
to Maxent
I am curious about how others may be approaching variable reduction in
Maxent distribution modeling. Obviously, our first decisions about
which variables to consider including are based on the life history
for the species to be modeled. In the past, we have used PCA as a
next step to determine which of the variables that make the
"ecologically significant" cut might explain the most variability in
the environment within our study areas. This simplified approach
allowed us to whittle down dozens of predictors to just a handful, and
is feasible when dealing with a large number of species where a
species-by-species, univariate approach to variable reduction would be
impractical. However, I have done a few tests that have shown that
sometimes, or perhaps most of the time, the variables that load
highest on the first few principal components may not be the variables
that do the best job of predicting distribution using Maxent.

As a substitute for variable reduction via PCA, I am now considering
whether the variable importance measures in Maxent could be used to
help select a subset of modeling layers. Specifically, my thought was
to run a Maxent model with all possible predictors (50+), and then
select which of the variables will go into the final model--the one
used to generate the map that will be the final product--based on some
interpretation of the order and values in the variable importance
tables. This brings up a number of questions:

1) It seems that picking the top 8 (for example) predictors based on
variable importance does not guarantee that we have selected the best
8-variable model...? In other words, selecting a model in this way
would be generally equivalent to a stepwise, rather than "best
subsets" selection method. Is this correct?

2) If we were to select variables based on their ranking and/or values
in the variable importance table, how do we decide how many variables
to include? I can think of at least 3 ways: 1) by setting a threshold
for how many of the top variables to use for all species (i.e. use the
top 8 variables for each species); 2) by setting a threshold for the
percent contribution (i.e. use all variables on the list down to the
point where the percent contribution falls below 1.0); and 3) by
setting a threshold for cumulative percent contribution (i.e. going
down the list of variables until their percent contributions sum to at
least 95%). Each of these possibilities has some readily apparent
drawbacks.

3) Finally, I was wondering whether it's possible to calculate a
metric analagous to AIC, or another information criterion, to
determine how many variables should be included in the final models
for each species? Is this possible, short of having to run all
possible models with 1-k variables and comparing model significance
(or some other metric) for each? A much more general approach that I
have tried thusfar is plotting the AUC vs. number of variables for a
handful of species, to determine where the AUC may asymptote, but it
seems that there should be an easier and more objective way to do
this.

I appreciate any input or comments anyone may have on these questions.

-Mark

B. Miller

unread,
Mar 18, 2009, 7:09:06 PM3/18/09
to Max...@googlegroups.com
Mark

As an FYI running some 80 species using the same climate and
environmental variable sets is deemed important for me. I am looking
at bats and the responses are considerably different for one group
or even species vs another.

E.G. those that are tightly linked to cave sites (obligatory cave
roosting species) will show that distance from cave entrances (a
continuous variable) is very important, while for other species that
do not necessarily use caves this variable is virtually ignored by
MaxEnt. Elevation is very important for some and not at all for others.

So a priori "reducing" the environmental variables would not give me
the best answerers. It does not cost anything more than some
computational time to run all layers for all species.

When I get to the larger landscape level analyses (the above is only
one country) for all of Mesoamerican I will have >220 species and
perhaps an additional 5 environmental variable layers. No telling a
priori which will prove to be important until I run all of them.

Bruce

Martin Damus

unread,
Mar 18, 2009, 7:28:11 PM3/18/09
to Max...@googlegroups.com

Hi,

I usually look at the variable set that I have to "play with" and select those that, based on the organism's biology, would mot likely influence their survival. For instance I normally model insects, for which average temperature during the growing period, average temp in the non-growing period, temperature seasonality, total days of frost and rainfall during the growing season are the most important predictors. For my purposes (risk assessment) I also want to err on the side of caution, and I find that including too many variables results in such tight overfitting that projections into novel areas are probably far too restricted.

Martin

--- On Wed, 3/18/09, Mark <markan...@gmail.com> wrote:
__________________________________________________________________
Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/

andrew yost

unread,
Mar 18, 2009, 7:30:22 PM3/18/09
to Max...@googlegroups.com
Mark,

Attached is an example of how I chose the best set of predictors. The last paragraph on page 380 explains the approach and figure 4 is an illustration. Hope that helps.

AYost
maxentSageGrouseEcoInf.pdf

Kendal Young

unread,
Mar 19, 2009, 12:48:40 AM3/19/09
to Max...@googlegroups.com

This is a great question that has plagued me for a long time.  After trying various approaches (including building models using all possible combinations of environmental variables), I have settled on this approach (at least for now):  

 

I first create a model that included all variables (global or full model).  This model is considered to have the most flexibility in fitting the data but may have low precision.   I then exclude all variables that contribute < 3% to the full model, and recreate the spatial model.  I then tested the resulting model for variable correlation.  Any correlated variables are removed from the final model by retaining the variable with the highest model contribution.  For my work, this approach seems to provide the most parsimonious model possible, e.g., a model that provides a balance between the extremes of having too few parameters (under-fitting) and models that have too many parameter (over-fitting).

 

I too investigated using Akaike’s Information Criterion (AIC).  I ran through the exercise, and the results seemed favorable.  However, I was unsure about Maxent compatibility with AIC assumptions.  And since I have not received any answers on that, I have chosen not to use AIC at this time.  I would be interested in learning if others have.

 

Kendal

_________________________________________

-----Original Message-----
From: Max...@googlegroups.com [mailto:Max...@googlegroups.com] On Behalf Of Mark
Sent: Wednesday, March 18, 2009 3:49 PM
To: Maxent
Subject: Variable reduction using Maxent?

 

 

I am curious about how others may be approaching variable reduction in

Tereza

unread,
Mar 19, 2009, 4:02:48 PM3/19/09
to Maxent
Kendal,


I like your approach. Have you published anything using this
methodology for variable selection that I could cite in my paper?

Thanks, Tereza

Marnin Wolfe

unread,
Mar 19, 2009, 4:07:39 PM3/19/09
to Max...@googlegroups.com
With MaxEnt, why would you need to use a model selection method like the ones described above? The regularization step weights the variables you input based on importance. That is meant to avoid overfitting.

Perhaps I am too new to using this method but would someone mind explaining why this is necessary?
--
Marnin Wolfe
University of Pittsburgh
Department of Biological Sciences
Ecology & Evolution Program
wol...@gmail.com (or)
md...@pitt.edu
239-595-5081

Kendal Young

unread,
Mar 19, 2009, 8:59:17 PM3/19/09
to Max...@googlegroups.com
Tereza,

I'm working on a manuscript, but have not yet put my methodology out for
review. I came across that approach by trying a lot of different ways.
I'll let you know when I have something under review. I am using this
approach to model invasive plants.

Kendal

_________________________________________
Kendal Young
New Mexico Coop. Fish and Wildlife Research Unit
Department of Fish, Wildlife and Conservation Ecology
Department of Animal and Range Science
Box 30003, MSC 4901
New Mexico State University
Las Cruces, NM 88003
(209)795-4406 - Current
-----Original Message-----
From: Max...@googlegroups.com [mailto:Max...@googlegroups.com] On Behalf Of

Mark

unread,
Mar 31, 2009, 4:30:26 PM3/31/09
to Maxent
Thanks, all, for the input.

The reason I would like some sort of variable reduction is precisely
for the computation time required by leaving them all in. I have a
regional project where I have about 50 variables, times 500+ species,
times a study area that includes 1.5 BILLION pixels. Running and
projecting a model for 1 species across this study area using only 7
variables takes anywhere from 3 to 8 hours. Running a model with 50+
variables, if this is even possible, would take days for each
species. Note that the time required to run the models is much less
if I am only generating models, not projecting them. That's why I was
considering using a Maxent "first pass" to identify the most important
variables using a run with no projection, then running and projecting
a model of some smaller subset of variables for the "final" model.

Kendal--your approach is interesting, and sounds promising, but I
wonder how consistent it is, given that the percent contributions will
go up or down depending on how many variables you have included. For
example, in a run with 8 variables, all may have > 3% contribution,
whereas if you stuck in 30 variables in addition to those 8, you might
find that many of those first 8 have a smaller percent contribution,
since it's taken as a percent of the total contribution of all
variables. In other words, the more variables you throw in to the
model, the smaller those percentages generally get. Second, what
threshold do you use for variable collinearity above which you discard
the variable with the lowest percent contribution? Finally, if you
have any sort of paper, report, or other summary on your work with
AIC, I would be very interested to see that. I don't have the
statistical/mathematical horsepower to comment on whether a typical
Maxent run meets the assumptions for using AIC--maybe Steven could
chime in here...?

Andrew--your approach seems very logical, but I'm wondering whether
you could provide some additional details. On page 381, you say,
"Using the overlap between 95% confidence intervals for training gain
averages as the criteria for significance the five-variable model
containing the spatial coordinates, vegetation class, elevation and
aspect was not significantly different than the two larger models but
was different than the remaining smaller models (Fig. 4)." I'm
wondering what test you used to determine whether the difference in
training gain averages were significant between models of different
sizes? It seems to me this is close to an information criteria-type
approach, but I'd like to understand the details of the data and test.

Again, thanks to all for the suggestions.

-Mark

yasmin...@students.mq.edu.au

unread,
Aug 11, 2015, 9:27:47 PM8/11/15
to Maxent, Max...@googlegroups.com

Hi there,
If you up to GIS you can browse SDM tools box and download this toolbox to be used into your GIS software. It includes most of the important steps of preparing and analysing data before and after running Maxent. There is a tool enables you to remove highly correlated variables. Tool: Explore Climate Data: Remove Highly Correlated Variables.

cheers
Reply all
Reply to author
Forward
0 new messages