script to purge data set of PCR duplicates

claudius

unread,

Aug 12, 2011, 6:50:21 AM8/12/11

to Stacks

Hi,

does anybody have a script to remove PCR duplicates from RAD data that
I could use?

I would like to purge my RAD data set of PCR duplicates and re-run
stacks with the cleaned data set. I have standard RAD data, so the
paired-end reads can be used to detect PCR duplicates (like the
fragment counts in RADtags).

many thanks for your help,

claudius

Nikoletta-Athanasia Gkatza

unread,

Jul 18, 2012, 11:55:46 AM7/18/12

to stacks...@googlegroups.com

Hi Claudius,

have you heard back from anyone abotu this? Have you maybe found a way to overcome this issue? Thank you!

Nicole

Claudius Kerth

unread,

Jul 25, 2012, 11:48:46 AM7/25/12

to stacks...@googlegroups.com

Hi Nicoletta,

sorry for my late reply. I am currently not following the stacks forum everyday.

As you've probably already seen in the forum, stacks comes with a script called "clone-filter" which purges PCR duplicates from standard RAD data (i. e. with random shearing and paired-end sequencing). The scripts outputs only in fasta format, i. e. quality scores get lost, but that doesn't matter since stacks doesn't take quality scores into account anyway.

If you want to keep quality info, then I can send you my script "purge_PCR_duplicates.pl", which I used a year ago in order to purge my RAD data of PCR duplicates. I would have to add the storage of median quality scores for uniques, though.

cheers,

claudius

--
For more options or to unsubscribe: http://groups.google.com/group/stacks-users
Stacks website: http://creskolab.uoregon.edu/stacks/

Nikoletta-Athanasia Gkatza

unread,

Jul 26, 2012, 4:54:57 AM7/26/12

to stacks...@googlegroups.com

Hi Claudius,

no need to apologise. Thank you very much for your informative response! Yes, I have noticed the “clone_filter” script. Is it possible to please send me your script as I would like to keep a record of the quality scores too.

In addition, if you have any guidelines/documentation regarding it that would be very much appreciated.

Best wishes,

Nicole

Claudius Kerth

unread,

Jul 30, 2012, 9:38:08 AM7/30/12

to stacks...@googlegroups.com

Hi Nicole,

you can download the script "purge_PCR_duplicates.pl" from:

https://github.com/claudiuskerth/scripts_for_RAD/blob/master/purge_PCR_duplicates.pl

Click on "Raw", then save the page as ASCII text. Under Unix, make the file executable with:

$ sudo chmod +x purge_PCR_duplicates.pl

$ purge_PCR_duplicates.pl -h

... for more explanation.

I have added the calculation of median quality scores for all single-end reads within a set of PCR duplicates. This should be better than just randomly picking a quality score string from one PCR copy. Currently, the paired-end sequence (and its quality score string) of a unique fragment is just the first that has been found by the script. As soon as I have time again, I will add the determination of consensus sequence for the PE sequences in a set of PCR duplicates. Also, be aware that that each sequencing error in the single-end reads in almost all cases generates a new unique. That means, the output will have a higher proportion of reads with sequencing errors than the input. Many sequencing errors can be identified and corrected if they belong to a set of PCR duplicates (i. e. reads from the same restriction fragment). I will add that at a later stage.

If you have any questions or need assistance with running the script, don't hesitate to ask me,

claudius

Reply all

Reply to author

Forward