High per sample coverage, low effective coverage per sample

Austin Koontz

unread,

Jan 19, 2022, 4:51:24 PM1/19/22

to Stacks

Background: I recently used the denovo_map.pl command in Stacks (v2.59) to build a de novo assembly of 198 paired-end samples (Quercus acerifolia). The parameters for that assembly are shown in the command below, and were determined based on parameter optimization runs I performed using a subset of my samples.

denovo_map.pl -T 28 -m 7 -M 4 -n 4 --gt-alpha 0.01 -N 0 --samples /Stacks/process_RADtags/QUAC --popmap ./QUAC_popmap -o ./output/ --paired --rm-pcr-duplicates -X "populations: -R 80"

Issue: While the per sample coverage of this dataset appears to be very high (mean of 70.54; ~11 million reads per sample), mean effective coverage (calculated using stacks-dist-extract gstacks.log.distribs) is very low (4.52). I've attached the relevant log files.

What does this say about this data? Or, how can effective coverage be so low when per sample coverage is so high, especially since these are (in theory) closely related individuals? I understand that effective coverage values are weighted based on the number of samples present at a locus, but looking at missing data frequencies in individuals after filtering in the populations module, they don't seem to be so high as to justify this vast difference in coverage values. Could it be that there are too many samples in this assembly, and so the catalog contains many loci unique to single individuals?

Thanks for any insight into this,
Austin

denovo_map.log

populations.log.distribs

gstacks.log.distribs

Catchen, Julian

unread,

Jan 24, 2022, 6:53:44 AM1/24/22

to stacks...@googlegroups.com

Hi Austin,

The answer to your question can be found in the denovo_map.log file:

Removed 551559315 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (93.2%); kept 39936998 read pairs.

As well as in the effective_coverages_per_sample table of gstacks.log.distribs.

Your data consist of > 90% PCR duplicates. This means that you had very few original DNA templates in the DNA extractions that were done during the molecular protocol. So, when you amplified the DNA using PCR, you generated lots and lots of copies of those very few DNA templates. Your sequencing appears to have gone well, so you sequenced all the PCR duplicate copies to high fidelity giving you very high coverage of a very small number of original pieces of DNA from your organism. Therefore, even though you have many sequenced reads, the information content of those reads (the independent sequence of nucleotides of each allele) is quite low.

julian

Diego Caraballo

unread,

Nov 14, 2023, 8:36:27 AM11/14/23

to Stacks

I am in the same situation. I performed the optimization for M and n values, and although I had high per sample coverages (20-30x), I have extremely low per sample coverages (mean=2.2x, stdev=0.3x, min=1.4x, max=2.7x), because of lots of PCR suplicates (91.4%). Can these data be saved in any way, or do I have to redo the libraries inevitably? Does it make sense not to remove the PCR duplicates, to at least, even if they are few loci, have greater coverage?

Best,

Diego

Angel Rivera-Colón

unread,

Nov 15, 2023, 2:57:27 PM11/15/23

to Stacks

Hi Diego,

I would preferably remove PCR duplicates whenever possible. Leaving the duplicates reads in, particularly at very high duplicate rates, creates very large biases in the distribution of observed alleles, which can negatively impact genotyping (see Fig. 5D in Rochette et al. 2023).

A better approach might be to remove PCR duplicates and increase the stringency of the of the genotyping calls to better account for the lower coverage (see section 2.9.2 of the Stacks 2 protocol paper; Rivera-Colón & Catchen 2022). This is always a tradeoff, of course. You get less genotypes called, but the genotypes you do get should be more accurate. You can then assess how the data is retained after filtering at this higher-stringency catalog.

Best,

Angel

Diego Caraballo

unread,

Jul 2, 2024, 11:33:41 AM7/2/24

to Stacks

Hi Angel,

Thanks for your response. I rerun gstacks following your recommendations. Basically I set gt-alpha to 0.01. But I did not improve coverage (see below).

gstacks default

Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Removed 151043299 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (91.4%); kept 14159588 read pairs.

Genotyped 635963 loci:

effective per-sample coverage: mean=2.2x, stdev=0.3x, min=1.4x, max=2.7x

gstacks gt-alpha=0.01

Model: marukilow (var_alpha: 0.01, gt_alpha: 0.01)

Removed 151043299 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (91.4%); kept 14159588 read pairs.

Genotyped 635963 loci:

effective per-sample coverage: mean=2.2x, stdev=0.3x, min=1.4x, max=2.7x

As you may see, the results are exactly the same. Would it be useful to try lower values for gt-alpha? If not, is there a chance to recover any information for genpop of phylogenetic analyses from these data?

Best,

Diego

Angel Rivera-Colón

unread,

Jul 8, 2024, 10:54:17 AM7/8/24

to Stacks

Hi Diego,

The var-alpha and gt-alpha parameters affect the genotyping (how sensitive the calls are to coverage), but not the assembly of the loci themselves. It is expected to see the same number of loci, length and coverage distributions after changing them. Do you see any differences in the genotypes? For example, how many variant sites and loci do you obtain after running populations? Do you see changes to the missing data per individual? That will provide you more information to the use of this data for downstream analyses.