Using training data as test data

76 views
Skip to first unread message

Yulia F.

unread,
Jul 5, 2021, 4:13:36 AM7/5/21
to Maxent
Dear all,

I do research on endemic plant species. I think that my colleagues and I know all the presence points of several species, since they have a narrow environmental niche and the area is well-researched.
We are interested in the habitat suitability change in the future in this specific region. The number of presence points is 30-80. We've been using cross-validation but now some of us are questioning whether it is the right choice.

So, is it necessary to use some points as a test sample in this case, or would it be better to use the training sample as a test sample? The latter option maximizes the number of presence points, but I an unsure of this option's accuracy, since it's rarely mentioned in publications.

I would appreciate any ideas and links!
Thanks in advance.
Yulia

Bede-Fazekas Ákos

unread,
Jul 5, 2021, 4:37:43 AM7/5/21
to max...@googlegroups.com
Hello Yulia,
"I know all the presence points of several species": in this case why not using an SDM method more fitted to presence-absence data, eg. GBM? You have real absences points now! MaxEnt is developed to deal with presence-only datasets.
"would it be better to use the training sample as a test sample?": yes, if you want to get a more reliable model, and no, if you want to measure the reliablitity/transferability. The order is the following:
1) train = test
2) cross-validation
3) split of the dataset to train and test set (e.g. 50-50% random split with prevalance stratification, or spatial/environmental blocks)
4) use gold standard (independent) test dataset
Towards (4), the option gives you more-and-more acurate estimation of thereliablity of your model. Itis rarely the case that we have two independent datasets, one fro training and one for testing. Within (1)-(3), towards (1), you get more-and-more presence data for training, which will let you train more reliable model. No best choice...
But if you choose no. (1), you will never be certain about the goodness of your model, and in a scientific publication, it is necessary to measure the goodness somehow. (AUC/TSS/Boyce etc. calculated on the training set has no meaning et all.)
So cross-validation (2) is a kind of optimal solution between loosing too much training points (3) and loosing the ability of measuring the reliability (1).
HTH,
á
--
You received this message because you are subscribed to the Google Groups "Maxent" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maxent+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/maxent/1c36a188-645d-4a74-834a-e6d422c2ef4fn%40googlegroups.com.

Yulia F.

unread,
Jul 6, 2021, 2:04:11 AM7/6/21
to Maxent

Dear Ákos,

Wow, thank you for the detailed response!

I had some ideas about the decreasing reliablitity, but it is super helpful to see these approaches put like a system. I'll have to reevaluate some of my ideas now :)

We use Maxent because having all the presence points is not the case with all of the species that we study (so it's for the sake of using one method for a lot of different rare endemic/relict species). However, I'm thinking about implementing other methods for more thorough research on some species, too.

I really appreciate your time.

Yulia F.

понедельник, 5 июля 2021 г. в 13:37:43 UTC+5, Ákos Bede-Fazekas:
Reply all
Reply to author
Forward
0 new messages