"PCR duplicates" & "Clone ID"/gl.filter.cloneid?

Gabriella Scatà

unread,

Apr 18, 2022, 1:57:01 AM4/18/22

to dartR

Hi everyone,
I was just wondering whether when DArT pre-filters the SNPs data before providing it to the final user, do they also remove potential PCR duplicates (that is "read duplicates", sequence reads that result from sequencing two or more copies of the exact same DNA fragment, and which are due to the fact that by mistake more than 1 PCR copy of the original DNA fragment hybridizes to the flow cell)?

As far as I understand, the "Clone ID" (which also matches the "Allele ID?) identifies unique target sequences (sequence tags), so that multiple sequence tags with the same Clone ID would represent "PCR duplicates", is that correct?

Does DArT already pre-filter and remove these PCR duplicates before providing the user with the final SNPs dataset, or should we filter PCR duplicates ourselves?

And if we have to filter out "PCR duplicates" ourselves, is the "gl.filter.cloneid" function the correct one to use?

When I try to use "gl.filter.cloneid", no loci are filtered out...I assume that is also because I had already filtered out secondaries with "gl.filter.secondaries".

In the DArTR manual, it says:

"gl.filter.secondaries: SNP datasets generated by DArT include fragments with more than one SNP and record them separately with the same CloneID (=AlleleID). These multiple SNP loci within a fragment (secondaries) are likely to be linked, and so you may wish to remove secondaries"

Would the "gl.filter.secondaries" function also have removed potential PCR duplicates?

I would really appreciate your help on this matter.
Thank you.
Best,
Gabriella

Arthur Georges

unread,

Apr 18, 2022, 8:22:12 PM4/18/22

to dartR

Hi Gabrielle,

The occurrence and management of multiple SNPs scored from a single sequence tag is a different issue from PCR duplicates. As I understand it, PCR duplicates result from sequencing two or more copies of the same fragment of DNA. Setting aside the possibility of sequencing errors, PCR duplicates can make the occurrence of the affected allele appear proportionately more abundant than it should compared to the other allele. This could potentially have an effect on the calls, particularly for heterozygotes. Most workflows have ways of detecting and eliminating PCR duplicates.

As for how DArT does this, that is a question for DArT not dartR. dartR only deals with post-processing after DArT has weaved its magic. dartR works from the DArT reports and does not delve down into the pipelines used by DArT to deliver reliable SNP and SilicoDArT data.

So you need to address this question to Diversity Arrays Technology.

Multiple SNPs scored from a single sequence tag -- we do manage those in dartR because they are included in the DArT reports. The relevant scripts are gl.report.secondaries and gl.filter.secondaries as you have indicated.

All the best, and hope you had a happy Easter break.

Arthur

Gabriella Scatà

unread,

Apr 27, 2022, 12:40:36 PM4/27/22

to dartR

Dear Arthur,

thank you so much for your detailed reply and so sorry for my late response.

I have already contacted DarT to have further info on the PCR duplicates, but nor response so far.

Thanks a lot for all the info and clarifications!

Hope you had a great Easter and Anzac holiday break.

Best,

Gabriella

Reply all

Reply to author

Forward