Lost in parameter space

Diego Caraballo

unread,

Aug 14, 2023, 3:02:02 PM8/14/23

to Stacks

Hi there. I am trying to perform the analyses recommended by Paris et al. 2017 (doi: 10.1111/2041-210X.12775), to decide which parameters of the STACKS pipeline should be better for a de novo RAD-Seq study.

I would like to ask if I am gathering the correct parameters for plotting different run conditions. First, I am testing these ranges:

m: 3-7

M 1-8

n: M, M+1, M-1

While testing one parameter, the others are kept fixed at m3, M2, n0, N4, and r80.

I extract coverage from the output denovo_map.log, in the section bounded by BEGIN cov_per_sample and END cov_per_sample. The variable is: depth of cov.

The number of assembled loci is extracted from the gstacks.log.distribs file, in the section bounded by BEGIN effective_coverages_per_sample and END effective_coverages_per_sample. The variable is: n_loci.

Now I am getting a little bit confused when trying to find the number of polymorphic loci per sample, and the number of SNPs per sample.

Is the number of polymorphic loci the n_loci included in the section variant_sites_per_sample of the populations.log.distribs file?

I would appreciate if you could help me corroborate where is this information in the outputs.

Best,

Diego

Catchen, Julian

unread,

Aug 15, 2023, 11:50:33 AM8/15/23

to stacks...@googlegroups.com

Hi Diego,

I would recommend that you look at our latest protocol paper, Rivera-Colon et al., which recapitulates the procedure in Paris, et al. and gives some more detail (and commands) to get the r80 values from the Stacks output files, in section 2.5.1.

In short, you are proposing to test too many parameters, for example, newer work has shown that -m can be fixed at 3 and -M and -n normally should be set to the same value (see Rivera-Colon for the details).

Best,

julian

Diego Caraballo

unread,

Sep 12, 2023, 9:10:15 AM9/12/23

to Stacks

Thanks for your reply, Julian. That protocol paper is a real gem!!

I followed the R80 test from M1 to M12 obtaining this result:

I am very close to zero in the change of R80 loci. My original dataset contains ca 5-6 individuals per population, from 7 related grasshopper species (same genus), and an outgroup. The subset with which I ran the R80 test has 20 individuals, from all different species and populations, including the outgroup, with retained reads values close to the mean of the original dataset.

Now I am running the M=13 analysis, but I wanted to ask if you consider that I should still keep the highest M value before the number of loci changes to negative, although the slope is almost zero in M values from M = 10.

Best,

Diego

Catchen, Julian

unread,

Sep 12, 2023, 3:33:03 PM9/12/23

to stacks...@googlegroups.com

Hi Diego,

It is a judgement call when to stop searching the parameter space. Sometimes, particular data sets will never converge/go negative. Regardless, you know from your plots that at or over M=9 you start to see diminishing returns, so you are likely capturing nearly all of the loci at that setting. I would be comfortable setting M to any of those latter values.

Best,

julian

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Diego Caraballo <except.for.me...@gmail.com>
Date: Tuesday, September 12, 2023 at 8:10 AM
To: Stacks <stacks...@googlegroups.com>
Subject: Re: [stacks] Lost in parameter space

Thanks for your reply, Julian. That protocol paper is a real gem!!

I followed the R80 test from M1 to M12 obtaining this result:

I am very close to zero in the change of R80 loci. My original dataset contains ca 5-6 individuals per population, from 7 related grasshopper species (same genus), and an outgroup. The subset with which I ran the R80 test has 20 individuals, from all different species and populations, including the outgroup, with retained reads values close to the mean of the original dataset.

Dylan O'Hearn

unread,

Sep 13, 2023, 1:54:55 PM9/13/23

to Stacks

Hi Diego,

I'm just chiming in as someone who has recently gone through the optimization procedure. I found that for popgen datasets, sticking to the M=n guideline from the latest paper worked fine, meaning that while different values (such as n=M+1 or n=M-1) did produce higher numbers of R80 loci, the difference was relatively small (like maybe 1,000 extra loci when I already had 80,000) and the change didn't affect the results of any actual follow-on analyses I did such as fineRADstructure.

Conversely, when I optimized a phylogenetic dataset, I found that getting to the true optimal set of values was important because I had some "bottlenecks" in my dataset, which were older museum samples that were the sole representatives of their taxa and that were subject to DNA degradation and allele dropout. For these particular samples, the difference between M=n and the actual optimal parameter settings was substantial and important to any follow-on analyses that included them.

So in conclusion, if you're doing popgen, or even if you're doing phylogenetics but all of your samples were in good condition, getting really down in the weeds with this optimization is probably not worth it.

Diego Caraballo

unread,

Oct 18, 2023, 10:18:54 AM10/18/23

to Stacks

Thanks Julian for your reply. The analysis indeed went negative from M=14, although there is a visible saturation from values M>9.

This is the summary of M13 (populations. log)
Removed 327030 loci that did not pass sample/population constraints from 336868 loci.
Kept 9838 loci, composed of 4799056 sites; 92996 of those sites were filtered, 45039 variant sites remained.
Number of loci with PE contig: 9838.00 (100.0%);
Mean length of loci: 477.81bp (stderr 0.32);
Number of loci with SE/PE overlap: 4298.00 (43.7%);
Mean length of overlapping loci: 487.13bp (stderr 0.21); mean overlap: 25.60bp (stderr 0.16);
Mean genotyped sites per locus: 482.26bp (stderr 0.32).

The coverage is acceptable

But when I give a closer look to other metrics as the number of loci from the catalog present in each sample, or variant sites per locus, or number of assembled loci, I notice that the outgroup behaves as an outlier.

It is expectable, since the outgroup is a less related species (a different genus). The ingroup is still composed of samples from 25 localities, representing 7-10 species from a single genus. I then ran the same R80 test, but changing the outgroup for another sample (with average retained reads).

This run converged (as excpected) earlier:

And there is no outlier anymore:

Populations summary for M9 (populations. log):

Removed 276452 loci that did not pass sample/population constraints from 290578 loci.
Kept 14126 loci, composed of 6881532 sites; 111064 of those sites were filtered, 56031 variant sites remained.
Number of loci with PE contig: 14126.00 (100.0%);
Mean length of loci: 477.15bp (stderr 0.22);
Number of loci with SE/PE overlap: 6136.00 (43.4%);
Mean length of overlapping loci: 487.48bp (stderr 0.18); mean overlap: 25.06bp (stderr 0.11);
Mean genotyped sites per locus: 481.54bp (stderr 0.22).

I made this library for making both a population genomics analysis, and a phylogeny. For the genpop analysis I will remove the otugroup, and according to your protocol, the optimal assembly parameter should be M=9. This is ok, since read lengths are "long". As there are different species in the pool, it is expectable to have several mismatches between samples for a given locus. So, what value would be appropriate for n? I guess it should be n>M, but I don´t know if I should make an additional R80 test fixing M=9, and changing n, or if there is another way to set it confidently.

Till now I was using:

n=M

N=M+2

m=3

Thank you very much for your guidance!

Best,

Diego

betsy.s...@gmail.com

unread,

Nov 10, 2023, 9:44:58 AM11/10/23

to Stacks

Hi Diego,

I am working on this exact step right now and I was curious if you tried anything for setting n.

Thanks

Betsy

Diego Caraballo

unread,

Nov 10, 2023, 10:00:31 AM11/10/23

to Stacks

Hi betsy! Yes, I tried an analogous test with n as did with M. I used a dataset excluding the outgroup (that is the dataset of interest for the population analysis), and obtained an optimum at M=9.

Then I fixed the value of M, and performed several runs with n values equal or above M. As M can be considered the level of biological polymorphisim within an individual, and n is equivalent to polymorphism across the population, it is expectable that n>M.

I calculated the difference between n and n-1 and obtained:

So my optimum value for n is 11.

I hope this helps!

Cheers,

Diego

betsy.s...@gmail.com

unread,

Nov 13, 2023, 11:35:34 AM11/13/23

to Stacks

Thanks so much, Diego! This helps a lot.

Reply all

Reply to author

Forward