parameter optimization and sample subset

Silvia Bettencourt

unread,

May 30, 2025, 6:00:23 AMMay 30

to Stacks

good morning all,

for the stacks parameter optimizations, is it wrong tho optimize with the all samples in study?

My RAD-seq species has only 24 samples sequenced. I thought to sub-sample this for the optimizations could bias my end results due to too low number of samples used in the optimization. I know the guidelines advise to sub-sample from the total amount of samples.

please advise.

thank you for your time

Silvia

Angel Rivera-Colón

unread,

May 30, 2025, 1:28:55 PMMay 30

to Stacks

Greetings Silvia,

I wouldn't say it is wrong per se. It is simply a tradeoff. The idea is to use a smaller subset of samples so that you can run the optimization process faster and explore parameters more efficiently. Aside from time differences, the optimization results themselves shouldn't be much different between running with 10 vs 24 individuals. Presumably, if you select a subset of samples that are representative of the whole population (or populations), the optimization process should be applicable to the whole pool of individuals sequenced. This can change somewhat depending on the underlying structure of your samples (e.g., if you have data from very divergent populations or different species you might want to optimize them a bit separately).

In summary, using the subset is mainly to aid with running things faster, allowing for a more efficient exploration of the parameter space.

Aside from this, my main advise is always to mention some key details about parameter optimization. While important, the optimization process itself is not the end goal of the project. It is easy to fall into optimization rabbit holes with diminishing returns. Also, the optimization process is not perfect. Some datasets don't biologically conform to the logic behind optimization (e.g., when having very divergent populations). Others might converge to multiple optimal parameters. Ultimately, the goal is to use the optimization as a tool for downstream biological insights, which is the central goal behind most projects.

Hope this helps.

Angel

Silvia Bettencourt

unread,

May 31, 2025, 4:21:47 PMMay 31

to Stacks

hello sir,

thank you for your reply, very clear and helpful. i think I´ll keep going with the optimization using the 24 samples. it is an endemic, tetraploid species, with discrete distribution, and the 24 samples belong to 3 populations. At higher M the running takes time but I´m about to run my last protocol at M=12. after that if the kept loci don´t drop below zero, I´ll stop and decide on how to proceed. my question about the subset was directly regarding the fact that M is not dropping below zero at high M, but if it is not wrong to use the entire sample size, then I´ll run this last one and stop.