PCR duplicates-ddRAD single end.

Ali Basuony

unread,

May 24, 2021, 7:20:37 AM5/24/21

to Stacks

Hello everyone,

I'm running a ref_map pipeline for a ddRAD single-end data and my question is how we can remove the putative pcr duplicates. I know that there is a command (--rm pcr duplicates) which we used with the denovo pipeline, which I think not applicable for ref_map. I think (--rm -pcr -duplicates) is working for single digest RADseq only. Is that right?

The data was generated in another lab 4 years ago and we don't know if they used degenerate barcodes to get ride-off PCR duplicates or not.

I found the following reply by Julian Catchen to a question like that,

"You cannot use the PCR duplicates filter on double digest data. The two
enzyme cut sites spoil the algorithm for detecting duplicates (identical
alignment of pairs of reads). The only option for handling PCR
duplicates in ddRAD is to use a random oligo in your P1 adaptor, like as
is done in the 3RAD protocol"

but I'm wondering what is the solution if the degenerate barcodes haven't been used.

In general, how I can be confident that there are no PCR duplicates in my data.

I'm using stacks (v2.4).

Thanks in advance for the help.

Ali

Julian Catchen

unread,

May 24, 2021, 9:35:20 AM5/24/21

to stacks...@googlegroups.com, Ali Basuony

Hi Ali,

Ask yourself: how can the software know that a pair of reads are a PCR
duplicate? How would you know if you were looking at the data? Does it
make sense that the software would have access to some special source of
information that you do not have access to?

You have quoted me giving the answer to this question.

Best,

julian

Ali Basuony wrote on 5/24/21 6:20 AM:

> Hello everyone,
>
> I'm running a ref_map pipeline for a ddRAD single-end data and my
> question is how we can remove the putative pcr duplicates. I know that
> there is a command (--rm pcr duplicates) which we used with the denovo
> pipeline, which I think not applicable for ref_map. I think (--rm -pcr
> -duplicates) is working for single digest RADseq only. Is that right?
> The data was generated in another lab 4 years ago and we don't know if
> they used degenerate barcodes to get ride-off PCR duplicates or not.
>
> I found the following reply by Julian Catchen to a question like that,

> *"You cannot use the PCR duplicates filter on double digest data. The two

> enzyme cut sites spoil the algorithm for detecting duplicates (identical
> alignment of pairs of reads). The only option for handling PCR
> duplicates in ddRAD is to use a random oligo in your P1 adaptor, like as

> is done in the 3RAD protocol"*
> *
> *
> **but I'm wondering what is the solution if the degenerate barcodes

Ali Basuony

unread,

May 24, 2021, 10:02:27 AM5/24/21

to Stacks

Hi Julian,

Thanks for clarifying that.

Does this mean that any ddRAD data that been generated without using degenerate barcodes are not valid? Or in a different way, stacks is not a good choice to manipulate this data.
There are many published ddRAD data that been generated without using degenerate barcodes. They mentioned that PCR duplicates are just one of ddRAD limitations.

I'm planning to send some samples for ddRAD to a sequencing facility which doesn't use degenerate barcode at all. They said to moderate this problem, you should target a high coverage. Is that right?

Thanks so much for your help.

Ali

Julian Catchen

unread,

May 24, 2021, 12:36:34 PM5/24/21

to stacks...@googlegroups.com, Ali Basuony

Hi Ali,

Answers below.

Ali Basuony wrote on 5/24/21 9:02 AM:

> Does this mean that any ddRAD data that been generated without using
> degenerate barcodes are not valid? Or in a different way, stacks is not
> a good choice to manipulate this data.

What does it mean for data to be "not valid" and why would that be
another way of saying that mean that Stacks is not a good choice?

As I mentioned in my previous message, this implies that there is some
secret information that can be obtained about PCR duplicates in ddRAD
data that Stacks is not using, but some other software has access to.
Again, what would that information be? I ask you to think about how one
can identify a PCR duplicate in a sequenced library?

> There are many published ddRAD data that been generated without using
> degenerate barcodes. They mentioned that PCR duplicates are just one of
> ddRAD limitations.

Yes, that is correct. Your ddRAD data may contain a few, or a lot of PCR
duplicates, there is no way around that, regardless of processing
software, full stop. That does not mean your data are not usable. As you
note, many many ddRAD datasets have been successfully published, many
using Stacks. But PCR duplicates, which were not well understaood when
ddRAD was first published, dilute the information your sequencing
library is providing.

> I'm planning to send some samples for ddRAD to a sequencing facility
> which doesn't use degenerate barcode at all. They said to moderate this
> problem, you should target a high coverage. Is that right?

If you plan to make ddRAD libraries there are two primary things you
should focus on: 1) start with high quality DNA in large quantities. If
your DNA is in very small amounts or degraded, it will result in a low
quality library -- that is a library with very few unamplified molecules
of DNA -- which is the information you are trying to get by sequencing.
2) You should reduce the number of PCR cycles you perform on your
libraries. The more PCR amplification you do, the more PCR duplicates
you will generate.

This is how #2 is related to #1, in a low quality library, you have very
little DNA, so most people crank up the PCR, which gives the illusion of
lots of DNA. Well, you do get lots of DNA, but it is all almost
exclusively clones/copies of the very few original molecules you started
out with.

Having good sequencing depth is very important for RAD data. However, if
your ddRAD library is full of PCR duplicates, increasing the sequencing
depth will cause you to sequence many more of the copies of your PCR
duplicates, without providing new information. To be specific,
increasing sequencing depth will provide you with more non-PCR duplicate
reads, however, you will get those at a slower rate than you are
generating PCR duplicate reads.

Best,

julian

Ali Basuony

unread,

May 24, 2021, 1:26:11 PM5/24/21

to Stacks

Hi Julian,

Thanks so much for your detailed reply. It's very clear now.
I did not mean stacks is not good at all. The problem is we are going to outsource the wet protocol and I want to make sure the data will be useable.

Again, thanks for your help, patient, and continuous support for all stacks users.

Ali

Reply all

Reply to author

Forward