Using -D option

Tejaswi Yarra

unread,

Feb 25, 2015, 11:58:05 AM2/25/15

to ea-u...@googlegroups.com

Hello,

I have a question about removing duplicates from the sequences using the -D option.
I am unsure as to what the number N represents.

The documentation indicates:
-D N Remove duplicate reads : Read_1 has an identical N bases (0)

If I do "-D 50" then Read_1 has 50 identical bases? What does this actually mean?
How can I actually use this option to remove duplicate sequences from my data?

Thank you in advance for your help and apoligies if my question was very amateurish.

Regards,
Teja.

Jason Powers

unread,

Apr 2, 2015, 11:16:30 AM4/2/15

to ea-u...@googlegroups.com

Teja,

The N here refers to the number of bases examined for duplication.

Let's say your read is a single-end fastq that is 50 nt long.

If you set N to 25, then it is going to find all unique 25mers (starting from the first base), and toss any duplicates.

Tejaswi Yarra

unread,

Jun 5, 2015, 11:38:46 AM6/5/15

to ea-u...@googlegroups.com

Thank you very much for the explanation Jason!
Very sorry for the late reply, I did not check back here in a long time.

Reply all

Reply to author

Forward