estimating the number of rad loci

coyk

unread,

Jul 22, 2022, 6:56:08 PM7/22/22

to Stacks

Hi all,

I would like to compare my estimates of how many rad loci I thought I would get from an experiment vs how many I actually got. When I estimate, I do a rough back of the envelope calculation. Hopefully, there's a genome in Genbank. If so, I take the size reported (e.g. 2.8E9) and the %GC reported (e.g. 0.41) and calculate the probability of my enzyme (e.g. SbfI) cutting across the genome (e.g. 6.46E-6) and multiply that by the genome size. In this case, I would expect 18k rad loci. This should be an overestimate, I think, because we randomly shear after the digest, then size select, unless the SbfI cut site occurs more frequently.

Which Stacks output do I compare this to? I don't use the wrapper scripts to run Stacks, but I save everything that prints to the console to a file. I don't see anything about loci in the ustacks output. In cstacks, there's "Final catalog contains x loci" and in gstacks there's "Genotyped x loci". Are either of these values comparable to the estimate above?

Thanks,

Katharine

Angel Rivera-Colón

unread,

Jul 25, 2022, 2:40:35 PM7/25/22

to Stacks

Hi Katherine,

First, regarding the estimations. As you mentioned, this is a back of the envelope calculation, and might not always be the most accurate. Note that the 18K is the number of SbfI sites in the genome, but if using a single digest protocol each restriction cutsite results in two independent RAD loci. In turn, you would expect 36K loci in this scenario. One advice, if there is a reference sequence for a related species you can use our software RADinitio to do an in in silico estimation of the number of cutsites. The software accounts for the actual distribution of restriction enzyme cutsites in the sequence, the effects of loci proximity, and size selection. You can read more about the software on our manuscript (Rivera-Colon et al. 2021). Figure 5 shows some comparisons between the in silico estimation and the genome size+GC methods you describe. In addition, we also discuss this loci estimation further in section 2.3 of our Stacks 2 protocol (Rivera-Colon & Catchen 2022).

For how these estimations compare to the Stacks output, that will depend on a few things. Each ustacks run is for a single individual, and the results do contain "junk" loci generated due to sequencing error in each sample. The catalog in gstacks has the data for all samples, but the raw number of loci in the catalog will most likely be inflated by the combination of junk loci across all individuals. The more samples in the catalog the larger this inflation will be. In my experience, a better estimate is obtained by running populations with some moderate filters (these depend on your data, but something like r50 may be sufficient). These filters remove the majority of the junk loci and retain the core of real loci present across all individuals, and should be comparable, at least in order of magnitude, to the initial loci number estimations. Note that the number will likely never match because 1) some loci will never be properly assembled, 2) some of the real loci might be filtered, and 3) the empirical data will always have some noise that will affect the final results (i.e., having low coverage data will make it harder for loci to be seen across all individuals). We also discuss some of these effects on section 2.8.1 of the protocol linked above.

In summary, to compare against an estimated number of RAD loci use the number of loci retained by populations after applying some moderate filters. The numbers will never match 100%, but this is expected.

Hope this information is useful.

Thanks,

Angel

coyk

unread,

Jul 26, 2022, 4:55:42 PM7/26/22

to Stacks

This is incredibly helpful. Thank you so much!

K

Reply all

Reply to author

Forward