--adapter_mm and process_radtags

239 views
Skip to first unread message

Debbie

unread,
Jun 3, 2014, 10:58:33 AM6/3/14
to stacks...@googlegroups.com
Hi All,

I am trying process_radtags on my data to demultiplex, clean and remove adapters. 

I am wondering if there are any rules of thumb I should/could use to calculate the number of mismatches I should allow in adapter_mm? In the stacks manual I see the example uses 2, but I wasn't sure why and if that was a value that should change with the length of the adapter. Currently I am trying to clip out the Illumina Truseq Universal Adaptor from read 2 and a Truseq Index adapter from read 1. I know they are there from a fastqc analysis. My data is paired end RAD. 

Any help or a point in the right direction would be greatly appreciated. 

Thanks for your help in advance, 

Debbie

Julian Catchen

unread,
Jun 4, 2014, 8:32:07 PM6/4/14
to stacks...@googlegroups.com, debbie...@googlemail.com
Hi Debbie,

My first question would be what is your insert length for the RAD
library? Do you expect a lot of overlap between your reads, or no
overlap at all? Second, I wouldn't set the mismatch parameter too high,
as each 'mismatch' is a sequencing error on the read and I wouldn't
expect more than a small number of sequencing errors across the whole
read, let alone just in the part of the read covered by adapter sequence.

Also, you may consider trimming a few bases off the ends of all of your
reads, if you don't expect a lot of adapter. Whenever process_radtags
finds adapter it will discard the read, so if you have a small amount of
adapter you can keep a lot more reads by trimming everything by, say
5bp, or something similar to that.

julian

Debbie

unread,
Jun 5, 2014, 5:23:02 AM6/5/14
to stacks...@googlegroups.com, debbie...@googlemail.com, jcat...@uoregon.edu
Hi Julian, 

Thanks so much for your fast response! 

I believe the samples were size selected to be between 300-500 bp long, as the post-doc who prepped the libraries used the Etter protocol. Based on that, I think there should be little overlap between the 2 reads as we sequenced on a HiSeq.  

Adaptors don't seem to be present at too high a frequency, but it is sufficient for fastqc to pick them up before I run process_radtags. After I run process_radtags they aren't detected anymore, but I'm unsure how stringent I should make the detection as there are still some over represented sequences. I have tried 2 mismatches and 6 mismatches. 6 mismatches equals 10% of the adaptor sequence, which I have seen previously used, but I think is quite high. 

Thanks for the tip about trimming away the last 5bp! 

All the best, 
Debbie 
Reply all
Reply to author
Forward
0 new messages