PCR duplicates

Silvia Bettencourt

unread,

Aug 27, 2025, 11:24:21 AM8/27/25

to Stacks

good morning all,

background: i am starting my journey on learning to analyse RAD-SEQ data. 24 samples; tetraploid species; single digested RAD; QC ok; M/n optimized at 12, for m=3 and max_locus_stacks =5; using galaxy.eu to analyse data as it has STACKS 2 as I´m a beginner.

DATA: average reads/sample 11M, ranging from 0.7M to 42M (just one sample under 1M). Ustacks mean coverage 12x; gstacks mean coverage 40x, ns_coverage per sample ranges from 9x to 124x.

problem: i think my data do not make sense coverage wise. plus the Gstacks log file does not provide me with the amount of PCR duplicates removed. Below is gstacks output.

Attempted to assemble and align paired-end reads for 806501 loci: 0 loci had no or almost no paired-end reads (0.0%); 4464 loci had paired-end reads that couldn't be assembled into a contig (0.6%); For the remaining 802037 loci (99.4%), a paired-end contig was assembled; Average contig size was 235.2 bp; 13332 paired-end contigs overlapped the forward region (1.7%) Mean overlap: 12.3bp; mean size of overlapped loci after merging: 212.2; Out of 75904698 paired-end reads in these loci (mean 90.0 reads per locus), 72205159 were successfuly aligned (95.1%); Mean insert length was 295.5, stdev: 116.8 (based on aligned reads in overlapped loci). Genotyped 802018 loci: effective per-sample coverage: mean=40.2x, stdev=36.4x, min=9.0x, max=124.7x mean number of sites per locus: 225.2 a consistent phasing was found for 100705 of out 127196 (79.2%) diploid loci needing phasing gstacks is done.

any suggestions on the coverage?

how can I understand the amount of PCR duplicates removed? the gstacks i´m using at galaxy.eu does not have a option to specifically check to remove PCR duplicates.

I appreciate your help and inputs.

Silvia Bettencourt

Catchen, Julian

unread,

Aug 27, 2025, 12:14:13 PM8/27/25

to stacks...@googlegroups.com

Hi Silvia,

If the --rm-pcr-duplicates flag was not specified to the gstacks program (or to denovo_map.pl), then no PCR duplicates would be removed. We are not connected to the Galaxy project, so I don’t know how they set this up.

If the flag had been specified, you would have seen something like this in the gstacks.log file:

Built 114736 loci comprising 100356561 forward reads and 92065401 matching paired-end reads; mean insert length was 340.0 (sd: 98.0).

Removed 8291160 unpaired (forward) reads (8.3%); kept 92065401 read pairs in 110571 loci.

Removed 51232513 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (55.6%); kept 40832888 read pairs.

You might want to check with the folks running Galaxy EU or run the program directly yourself.

You appear to have high variability in coverage, so you may want to remove any samples <5x or so from the analysis (though it seems you are above that threshold). You could also consider downsampling some of the individuals that have very high coverage.

Regardless, I would proceed to the populations stage to see how many SNPs you have that are shared across your samples with a basic set of filters. Assessing your ‘missing data’ may be more useful than trying to assess coverage (beyond removing samples that obviously failed in the molecular library construction).

Best,

Julian

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/stacks-users/2d5cd186-fcff-46a0-88cd-833dae628365n%40googlegroups.com.

Silvia Bettencourt

unread,

Aug 28, 2025, 8:48:01 AM8/28/25

to Stacks

hello sir,

Good afternoon,

thank you for your valuable inputs.

since I´m learning from scratch the galaxy project seemed more suitable for this stage that i am at. once i feel more comfortable with all the inputs, concepts, and outputs, I´ll then start using the original STACKS and dealing with command lines.

yes sir, i´ll move forward to populations, as the PCR duplicate removal is not available at the galaxy.eu, although I´ll contact the project with the question about it.

and after the data analysis then I´ll return and down sampling the highest coverage samples as suggested and see what happens.

once gain thank you so much for all your help and time.

cheers

Silvia Bettencourt

Reply all

Reply to author

Forward