Crossvalidate or Subsample?

Heather

unread,

May 17, 2010, 5:52:40 PM5/17/10

to Maxent

Hi everyone,

I am running models to predict the distribution of various lemur
species in Madagascar (I have relatively small data-sets (mostly <100
occurrence points for each species)) and am uncertain of the
distinction between the run types crossvalidate and subsample. I have
run all my models for both types with the same parameters (random
seed, use duplicate records...) and a 25% test percentage. The
predicted areas are very similar, however, the subsample run types is
slightly more conservative (slightly smaller extent of predicted
occurrence). The description of the subsample run type says it uses a
random 75% of occurrence points to train the model, then tests it with
the remaining 25%. The crossvalidate run type says it divides the
data into replicates folds and each fold is in turn used for test
data. This is somewhat confusing, I am unsure of what exactly the
model is doing with this run type. The default settings are for
crossvalidate with the number of replicates equal to 1. I was
wondering if anyone had any ideas on which is preferred and why or had
any suggestions for me.

If anyone can help clarify this issue it would be greatly appreciated!

Thanks,

Heather

--
You received this message because you are subscribed to the Google Groups "Maxent" group.
To post to this group, send email to max...@googlegroups.com.
To unsubscribe from this group, send email to maxent+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/maxent?hl=en.

ZuZu

unread,

Jun 1, 2010, 10:37:37 PM6/1/10

to Maxent

I would suggest cross-validation: you use all your data to develop and
to test (therefore extracting the most from it), all data points are
weighted evenly (leaving you less vulnerable to an unlucky split), and
the statistical properties are a little nicer (it's easier to identify
outliers & so on). 10-fold is common if you have a reasonable number
of points: that means dividing your data in 10 equal-sized groups
(folds), then using 9 of these as training data and one as a testing
set. You do the same thing 9 more times, holding out a different group
for testing each time. One split, 10 model fits. Cross-validating with
1 fold is the same as using all your data in the development set. If
your datasets are very small, of course, you simply have a problem of
lack of information, and neither technique will serve well.

There are good resources on this stuff - as painful as you may find
detailed statistics, you'll find it worth your while to develop a deep
understanding of what exactly you are doing.

ZuZu

Michael Anderson

unread,

Jun 2, 2010, 8:42:28 AM6/2/10

to max...@googlegroups.com

Hi Heather,

I am also working on potential distribution models in Madagascar. I'm focusing on Lepilemur in the north and northwest. I was curious as to what species and area your research is taking place. Maybe we could collaborate in some fashion if there is some overlap.

~Mike

Heather Peacock

unread,

Jun 2, 2010, 12:49:53 PM6/2/10

to max...@googlegroups.com

Hi Zuzu,

Thanks for the clarification. Unfortunately I do not have very large datasets. All but 1 species has fewer than 100 records and most are round 50. I am also using the bootstrap method for species with fewer than 25 records. I would like to try the crossvalidation technique in addition to the subsampling technique and see if there are any marked differences in the outputs, however, having such small datasets I am unsure as to how to determine the number of replicates. Is there an easy way to decide or is it somewhat arbitrary?

Thanks!

Heather

Heather Peacock

unread,

Jun 2, 2010, 12:57:59 PM6/2/10

to max...@googlegroups.com

Hi Mike,

I am doing a country wide analysis of lemur distribution/range to assess how effective the network of Protected Areas will be at preserving lemur diversity. I have a bunch of occurrence data for about 50 species (ranging the taxon - including Lepilemur - and various levels of threat/IUCN status), that I am using to predict [likely] extent of occurrence for each.

What specifically are you looking at with respect to Lepilemur?

It's my masters thesis project so any sort of advice or help you could give me would be wonderful!

Heather

Reply all

Reply to author

Forward