Hi Katherine,
First, regarding the estimations. As you mentioned, this is a back of the envelope calculation, and might not always be the most accurate. Note that the 18K is the number of SbfI sites in the genome, but if using a single digest protocol each restriction cutsite results in two independent RAD loci. In turn, you would expect 36K loci in this scenario. One advice, if there is a reference sequence for a related species you can use our software RADinitio to do an in
in silico estimation of the number of cutsites. The software accounts for the actual distribution of restriction enzyme cutsites in the sequence, the effects of loci proximity, and size selection. You can read more about the software on our manuscript (
Rivera-Colon et al. 2021). Figure 5 shows some comparisons between the
in silico estimation and the genome size+GC methods you describe. In addition, we also discuss this loci estimation further in section 2.3 of our Stacks 2 protocol (
Rivera-Colon & Catchen 2022).
For how these estimations compare to the Stacks output, that will depend on a few things. Each ustacks run is for a single individual, and the results do contain "junk" loci generated due to sequencing error in each sample. The catalog in gstacks has the data for all samples, but the raw number of loci in the catalog will most likely be inflated by the combination of junk loci across all individuals. The more samples in the catalog the larger this inflation will be. In my experience, a better estimate is obtained by running populations with some moderate filters (these depend on your data, but something like r50 may be sufficient). These filters remove the majority of the junk loci and retain the core of real loci present across all individuals, and should be comparable, at least in order of magnitude, to the initial loci number estimations. Note that the number will likely never match because 1) some loci will never be properly assembled, 2) some of the real loci might be filtered, and 3) the empirical data will always have some noise that will affect the final results (i.e., having low coverage data will make it harder for loci to be seen across all individuals). We also discuss some of these effects on section 2.8.1 of the protocol linked above.
In summary, to compare against an estimated number of RAD loci use the number of loci retained by populations after applying some moderate filters. The numbers will never match 100%, but this is expected.
Hope this information is useful.
Thanks,
Angel