High PCR duplication (~90%) across de novo assembly parameters

167 views
Skip to first unread message

Austin Koontz

unread,
Dec 13, 2021, 5:21:06 PM12/13/21
to Stacks
Hello Stacks users and developers,

Background: I'm using Stacks (v2.59) to generate a de novo assembly of 198 paired-end samples (Quercus acerifolia ). Following Paris et al. 2017 as well as the Stacks 2 preprint on bioRXiv, I'm trying to optimize this assembly by exploring a range of parameter values (m: 3--7; M and n: 1--8; gt-alpha: 0.05, 0.01) to maximize coverage and total/polymorphic loci. I'm finding that across all parameter combinations, rates of PCR duplication are very high (mean of 92.5% across 80 assemblies; I'm calculating this by running stacks-dist-extract gstacks.log.distribs effective_coverages_per_sample, and averaging across assemblies). An example denovo_map.log for one of these assemblies is attached.

Question: does this indicate a genuine over-amplification of template DNA during library prep, or could this somehow a result of faulty parameter specifications? Our samples are relatively high coverage (average 11.5 million retained reads per sample; see process_radtags.log), and so I wonder if it's possible that these high PCR duplication rates are simply a product of (relatively) high coverage (although effective coverage values are low). Is there a way I could tell?

Additional context:
  1. These samples were generated using the NextRAD library prep procedure (description here), which generates fragments fixed on one end and of variable lengths prior to PCR amplification. So, I believe the specification of the --rm--pcr-duplicates flag is correct. 
  2. This library prep workflow uses selective primers for amplification of fragments, rather than  particular restriction enzymes. Hence the --disable-rad-check in the process_radtags command.
  3. I've also attached an .Rdata file that contains a table which summarizes assembly metrics across parameters, in case that information is useful.
process_radtags.log
denovo_map.log
assemblyMetrics.Rdata

Catchen, Julian

unread,
Dec 13, 2021, 5:37:28 PM12/13/21
to stacks...@googlegroups.com

Sorry, but PCR duplicates are a product of library construction, they are not created/modified by post-sequencing data analysis (we have a paper coming out on this hopefully eventually). If you start with too little DNA template, and then amplify it, you will amplify lots of copies of your very few numbers of template molecules. Higher coverage will result in more PCR duplicates, but it is a function of what templates were in your library to start with, and then some subset of those templates/copies were randomly selected onto the flow cell when they were sequenced. However, this is not a matter of over-amplification, per se, it is a matter of too little template DNA to start with. If you had enough template at the start, the amplification would reflect that, not the other way around.

 

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Austin Koontz <ako...@mortonarb.org>
Date: Monday, December 13, 2021 at 4:23 PM
To: Stacks <stacks...@googlegroups.com>
Subject: [stacks] High PCR duplication (~90%) across de novo assembly parameters

Hello Stacks users and developers,

Background: I'm using Stacks (v2.59) to generate a de novo assembly of 198 paired-end samples (Quercus acerifolia ). Following
Paris et al. 2017 as well as the Stacks 2 preprint on bioRXiv, I'm trying to optimize this assembly by exploring a range of parameter values (m: 3--7; M and n: 1--8; gt-alpha: 0.05, 0.01) to maximize coverage and total/polymorphic loci. I'm finding that across all parameter combinations, rates of PCR duplication are very high (mean of 92.5% across 80 assemblies; I'm calculating this by running stacks-dist-extract gstacks.log.distribs effective_coverages_per_sample, and averaging across assemblies). An example denovo_map.log for one of these assemblies is attached.

Question: does this indicate a genuine over-amplification of template DNA during library prep, or could this somehow a result of faulty parameter specifications? Our samples are relatively high coverage (average 11.5 million retained reads per sample; see process_radtags.log), and so I wonder if it's possible that these high PCR duplication rates are simply a product of (relatively) high coverage (although effective coverage values are low). Is there a way I could tell?

Additional context:

1.       These samples were generated using the NextRAD library prep procedure (description here), which generates fragments fixed on one end and of variable lengths prior to PCR amplification. So, I believe the specification of the --rm--pcr-duplicates flag is correct. 

2.       This library prep workflow uses selective primers for amplification of fragments, rather than  particular restriction enzymes. Hence the --disable-rad-check in the process_radtags command.

3.       I've also attached an .Rdata file that contains a table which summarizes assembly metrics across parameters, in case that information is useful.

Reply all
Reply to author
Forward
0 new messages