process_radtags - high percentage of RAD cutsite not found drops

708 views
Skip to first unread message

Paula Reyes

unread,
Mar 5, 2019, 8:21:09 AM3/5/19
to Stacks
Hello,
I have data paired-end and double digest (apeKI and pstI). I run process_radtags with 384 samples and I found that the average percentage of RAD cutsite not found drops is 70%. I think the process_radtags discards reads that do not have the RAD cutsite at both ends. I followed this thread (https://groups.google.com/forum/#!topic/stacks-users/Sa8X4aArDHA) and I will ask for the raw data, since I have demultiplexed reads.


Is there any way to keep the reads that do not have the RAD cutsite at both ends (other than running process_radtags with the single enzime option). I believe it could be possible to have sequences that are longer than 150bp the read length sequenced, the sequences are cut and for this reason the RAD cutsite is not found and the sequences is discarded.


Best Regards
Paula
Message has been deleted

Catchen, Julian

unread,
Mar 7, 2019, 4:14:48 PM3/7/19
to stacks...@googlegroups.com, Juncong Yan
Dear John,

The process_radtags program works fine on double-digest RAD data. Data
from the GBS protocol is not double-digest RAD data.

Your comment exactly reverses the problem you are encountering. The
process_radtags program is very simple with respect to cutsites. It
looks at the sequence on the single- and paired-end reads, and looks for
the remainder cut site sequence that should be present. If it doesn't
find the expected sequence, it tells you so. If it finds the sequence
with a sequencing error as well, it will correct the sequencing error
(with the -r flag).

You say that when you disable this check, 99.8% of your reads pass "QC",
but you have now disabled QC effectively, so your statement is
meaningless. Well, the phred scores of your sequenced nucleotides are
high enough to avoid having the reads discarded, so that is good.

The question you should ask yourself is not, "Why does process_radtags
not work very well?" And instead ask the question, "Why do my reads not
have the cut sites I expect them to have, based on the molecular
protocol that was used?"

The latter question is more meaningful with respect to your biological
data analysis.

julian

Juncong Yan wrote on 3/7/19 2:52 PM:
> Hi Paula
> I am doing the same job currently and meet the same issue you have. It
> looks process_radtags does not work very well on ddRAD. the code "
> process_radtags -P -p ./Rawdata/ -o ./taged_data/ -b
> ./barcodes/barcode_1 -c -q -r --index_null --renz_1 mspI --renz_2
> nlaIII" I used lead to 87.4% of RAD cutsite not found drop. The way I
> found to solve this issue is using " process_radtags -P -p ./Rawdata/ -o
> ./taged_data/ -b ./barcodes/barcode_1 -c -q -r --index_null --renz_1
> mspI --renz_2 nlaIII --disable_rad_check". after that 99.8% reads passed QC.
> John
>
>
> 在 2019年3月6日星期三 UTC+13上午2:21:09,Paula Reyes写道:
>
> Hello,
> I have data paired-end and double digest (apeKI and pstI). I run
> process_radtags with 384 samples and I found that the average
> percentage of RAD cutsite not found drops is 70%. I think the
> process_radtags discards reads that do not have the RAD cutsite at
> both ends. I followed this thread
> (https://groups.google.com/forum/#!topic/stacks-users/Sa8X4aArDHA
> <https://groups.google.com/forum/#!topic/stacks-users/Sa8X4aArDHA>)
Message has been deleted

Catchen, Julian

unread,
Mar 8, 2019, 9:47:12 AM3/8/19
to stacks...@googlegroups.com, Juncong Yan
That's not correct, the ddRAD protocol (and process_radtags) does not
work that way. The cutsite will be the first sequence (after an optional
inline barcode) on both the single- and paired-end, regardless of insert
size. See Peterson 2012 (or contact the individuals who constructed your
libraries).

Have you asked the sequencing facility if they spiked other DNA into
your lane. Often PhiX is added to a RAD sequencing lane to increase
sequence diversity, which must be removed informatically, or you may
have a bunch of unrelated, but sequencable DNA, in your library.

Juncong Yan wrote on 3/7/19 5:29 PM:
> If I guess is correct, the problem  is the program “ process_radtags
> —renz_1 —renz_2" tries to find the sequences/reads which have cutsites
> of two RE both which means the sequence between two cutsites must less
> than 150bp due to the Illumina seq tech. However, this kind of sequence
> is rare. Normally the gap between two RE is far more than 150 bp so that
> only one RE’s cutsite will be kept in sequencing data.
>
> I tried to run program process_radtags -e for each RE separately and got
> 60.7% passed QC  for RE1 while 48.2% for RE2 and 12.6%for RE1and RE2
> both. So it become a statistic question, totally, 96.3% has at least one
> cutsites.在 2019年3月6日星期三 UTC+13上午2:21:09,Paula Reyes写道:
>
Reply all
Reply to author
Forward
0 new messages