About process_radtags barcode rescure

939 views
Skip to first unread message

高驥

unread,
Mar 2, 2015, 9:34:04 AM3/2/15
to stacks...@googlegroups.com
Dear Julian and Stacks Users,

I have questions about barcode_dist and it rescure  algorithm.

Our ddRAD Data have one site designed 8 12bp barcodes, and each other's edit distance is 5.

Barcode A    AGTTGTGGACCT
Barcode B    CCCACTTGAAAT
Barcode C    CTCATGGTCTAT
Barcode D    GCAAACACGTTT
Barcode E    GCACGCTAAGCA
Barcode F    GTTGTACCTCGG
Barcode G    TATCACTCCGGG
Barcode H    TGGGAACAGGGA


And the other site is illumina Index like Peterson et al. (2012), Index 05 Sequencing Quality QC isn't passed,so it discarded.

01 CGATGT
02 TTAGGC
03 TGACCA
04 ACAGTG
06 TAGCTT
07 GGCTAC
08 CTTGTA
09 CCGTCC
10 GTGAAA
11 GTTTCG
12 ATTCCT

I try a small data set to test demultiplex method. A data set had Raw data first 10000 Paired-End Reads, contained 11 Index. Each Index had 20000 Reads.

And used Stacks process_radtags v1.27 compared with  Costea et al. (2013) - TagGD findIndex command which used edit ditstance to demultiplex.

But TagGD just find barcode and didn't filter bad quaility and restriction site.

Stacks process_radtags is more poweful and efficiency.

But when I used small dataset to find result different in Stacks and TagGD

I turn off quality filter and rad_check just rescure barcode_dist < 5.

I used the command in Stacks version 1.27 like this:

process_radtags -P -p ../ -o ./No_RAD_No_Quality_Filter -r -D -b ./Barcode.list.txt --inline_index --renz_1 pstI --renz_2 mspI --barcode_dist 5 --disable_rad_check


The log.file like attachment.

Partial like this:

File Retained Reads Low Quality Ambiguous Barcodes Ambiguous RAD-Tag Total
Test_Spicy_pool_10_0317_P10_GTGAAA_R1_10.fastq 19888 0 112 0 20000
Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq 19890 0 110 0 20000
Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq 19820 0 180 0 20000
Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq 19872 0 128 0 20000
Test_Spicy_pool_2_0317_P2_TTAGGC_R1_02.fastq 19864 0 136 0 20000







Total Sequences 220000
Ambiguous Barcodes 1588
Low Quality 0
Ambiguous RAD-Tag 0
Retained Reads 218412
Barcode Filename Total No RadTag Low Quality Retained
AGTTGTGGACCT-CGATGT H_F2_008 2678 0 0 2678
CCCACTTGAAAT-CGATGT H_F2_011 1906 0 0 1906
CTCATGGTCTAT-CGATGT B_F2_014 434 0 0 434
GCAAACACGTTT-CGATGT H_F2_015 1954 0 0 1954
GCACGCTAAGCA-CGATGT B_F2_016 3314 0 0 3314
GTTGTACCTCGG-CGATGT H_F2_017 3294 0 0 3294
TATCACTCCGGG-CGATGT H_F2_018 2616 0 0 2616
TGGGAACAGGGA-CGATGT H_F2_019 2098 0 0 2098

The results looks great.

But when I  summary two methods result. I find very strange things in Index.

Summary like this:
















































I don't known why Index 01 reads only had 18294 reads, it should have 19872 reads.

And I check Test data Index. use this command to find Index 01's Index

 cat Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

 9189   CGATGT
   620   CGATGC
   114   CGATCT
     14   CTATGT
     11   AGATGT
      9   CGATAT
      7   CGATGA
      6   CGTTGT
      6   CGATGG
      5   CGATTT
      5   CGACGT
      3   TGATGT
      3   CGAAGT
      3   CAATGT
      2   CGAGGT
      2   CCATGT
      1   CGGTGT

It has perfect match CGATGT 9189 and dist 1 nt CGATGC 620  CGATCT 114 CTATGT  14 AGATGT 11...

First two (9189 + 620) X 2 = 19618 > 18294.

I don't know what happen in this condition.

I try different barcode_dist like 2 (default) , the result is less than 18294.

The other hand, Index 04 08 09 12 > 20000 reads. It didn't make sense. (One Index just has 20000 reads)

The  Barcode_list is in attachments.

Small dataset download (17Mb)


Anyone have thoughts ?

Thanks.

Sincerely,

             Kao Chi














process_radtags.log
Barcode.list.txt

Julian Catchen

unread,
Mar 2, 2015, 1:21:03 PM3/2/15
to stacks...@googlegroups.com, electr...@gmail.com
Hi Kao Chi,

Thanks very much for your detailed analysis. I would like to get to the
bottom of any discrepancies. Before I do, however, I just want to verify
one thing.

When you specify the barcode distance to process_radtags, it is not the
edit distance, the edit distance is one less than what is specified with
--barcode_dist.

Specifying the barcode distance is a very old option for the program and
it is meant to be interpreted as the distance between barcodes. Given a
nucleotide distance of five between barcodes, we can correct anything
with up to four errors without risking converting one barcode into
another barcode.

So, to compare exactly against the other program you are using, you
should specify --barcode_dist 6 to get an edit distance of 5.

Perhaps we should change this to be strictly the edit distance in the
code base.

Anyway, if you could update your comparison this way, I will look into
the discrepancies that remain.

Best,

julian


高驥 wrote:
> Dear Julian and Stacks Users,
>
> I have questions about barcode_dist and it rescure algorithm.
>
> Our ddRAD Data have one site designed 8 12bp barcodes, and each other's
> edit distance is 5.
>
> Barcode AAGTTGTGGACCT
> Barcode BCCCACTTGAAAT
> Barcode CCTCATGGTCTAT
> Barcode DGCAAACACGTTT
> Barcode EGCACGCTAAGCA
> Barcode FGTTGTACCTCGG
> Barcode GTATCACTCCGGG
> Barcode HTGGGAACAGGGA
>
>
> And the other site is illumina Index like Peterson/et al./ (2012), Index
> 05 Sequencing Quality QC isn't passed,so it discarded.
>
> 01 CGATGT
> 02 TTAGGC
> 03 TGACCA
> 04 ACAGTG
> 06 TAGCTT
> 07 GGCTAC
> 08 CTTGTA
> 09 CCGTCC
> 10 GTGAAA
> 11 GTTTCG
> 12 ATTCCT
>
> I try a small data set to test demultiplex method. A data set had Raw
> data first 10000 Paired-End Reads, contained 11 Index. Each Index had
> 20000 Reads.
>
> And used Stacks process_radtags v1.27 compared with Costea/et al./
> <https://lh3.googleusercontent.com/-HoFkt5bG-bA/VPRuB2O3b2I/AAAAAAAAADo/vxCU5RxZYj8/s1600/2015-03-02_220610.png>

高驥

unread,
Mar 2, 2015, 10:37:11 PM3/2/15
to stacks...@googlegroups.com, electr...@gmail.com, jcat...@illinois.edu
Hi Julian,

I specify barcode_dist 6 to run the program, the partial log file like this:

process_radtags -P -p ../ -o ./No_RAD_No_Quality_Filter -r -D -b ./Barcode.list.txt --inline_index --renz_1 pstI --renz_2 mspI --barcode_dist 6 --disable_rad_check
process_radtags version 1.27 executed 2015-03-03 10:08:03
File Retained Reads Low Quality Ambiguous Barcodes Ambiguous RAD-Tag Total
Test_Spicy_pool_10_0317_P10_GTGAAA_R1_10.fastq 19908 0 92 0 20000
Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq 19904 0 96 0 20000
Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq 19842 0 158 0 20000
Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq
19888 0 112 0 20000
Test_Spicy_pool_2_0317_P2_TTAGGC_R1_02.fastq 19874 0 126 0 20000
Test_Spicy_pool_3_0317_P3_TGACCA_R1_03.fastq
19888 0 112 0 20000
Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq 19884 0 116 0 20000
Test_Spicy_pool_6_0317_P6_TAGCTT_R1_06.fastq 19876 0 124 0 20000
Test_Spicy_pool_7_0317_P7_GGCTAC_R1_07.fastq
19846 0 154 0 20000
Test_Spicy_pool_8_0317_P8_CTTGTA_R1_08.fastq 19880 0 120 0 20000
Test_Spicy_pool_9_0317_P9_CCGTCC_R1_09.fastq 19826 0 174 0 20000
Total Sequences 220000
Ambiguous Barcodes 1384
Low Quality 0
Ambiguous RAD-Tag 0
Retained Reads 218616
Barcode Filename Total No RadTag Low Quality Retained
AGTTGTGGACCT-CGATGT H_F2_008 2680 0 0 2680
CCCACTTGAAAT-CGATGT H_F2_011 1902 0 0 1902
CTCATGGTCTAT-CGATGT B_F2_014 432 0 0 432
GCAAACACGTTT-CGATGT H_F2_015 1960 0 0 1960
GCACGCTAAGCA-CGATGT B_F2_016 3314 0 0 3314
GTTGTACCTCGG-CGATGT H_F2_017 3292 0 0 3292
TATCACTCCGGG-CGATGT H_F2_018 2614 0 0 2614
TGGGAACAGGGA-CGATGT H_F2_019 2098 0 0 2098

And then, summary like this table:



Index 01 reads number still looks strange.

Then, Index 04 reads number is 22896 , it's too many reads in this index.

I check other Test data Index. use  command to find Index readcounts.

 cat Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r                

10000 ACAGTG

The other indexs,

 cat Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r                
  10000 ACAGTG

 cat Test_Spicy_pool_7_0317_P7_GGCTAC_R1_07.fastq  | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r                
  10000 GGCTAC

.....

 cat Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r               
  10000 GTTTCG

 cat Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r               
   9795 ATTCCT
     40 CTTCCT
     27 ATTCCA
     20 ATTCNT
     12 ATTTCT
     12 ATTCCN
     12 ATTCAT
     11 ATTACT
      9 GTTCCT
      9 ATCCCT
      8 ACTCCT
      7 ATACCT
      7 AATCCT
      6 ATTCCC
      5 TTTCCT
      5 ATTCTT
      5 AGTCCT
      4 ATGCCT
      3 ATTCCG
      2 ATTGCT
      1 ATTCGT


It seems not problem in Index.

Please find  the discrepancies between TagGD and Stacks process_radtags

And tell me why Index reads will demultiplex like the result.



Hope edit distance option can be used in stacks.

It's a good tool to tread ddRAD or other GBS Data.

Thank you very much !!

Sincerely,

             Kao Chi








Julian Catchen於 2015年3月3日星期二 UTC+8上午2時21分03秒寫道:
process_radtags.log

Julian Catchen

unread,
Mar 4, 2015, 12:43:33 PM3/4/15
to stacks...@googlegroups.com, electr...@gmail.com
Hi Kao Chi,

Again, thanks very much for the detailed test case. This makes it much easier to tackle your question.

I narrowed down your test case to just look at the single-end data from one of your 11 files. Given the 11, 12bp barcodes you supplied, I found 27 cases out of the 10,000 reads where process_radtags could not recover a barcode but the TagGD program could. It turns out that TagGD is doing a full sequence alignment on each barcode so it has the ability to detect insertions and deletions in the barcode, while process_radtags only looks at mismatches in the barcode.

These 27 cases appear to have various indels in the barcode sequence. Here are a few examples:

(true barcode) GCACGCTAAGCA
                    *
(illumina seq) GCACC-TAAGCAT


(true barcode)  AGTTGTGGACCT
                 *
(illumina seq)  A-TTGTGGACCT


(true barcode)  GTTGTACCTCGG
                   *
(illumina seq)  GTT-TACCTCGGT


(true barcode)  AGTTGTGGACCT
                  *
(illumina seq)  AG-TGTGGACCTT


(true barcode)  G-TTGTACCTCGG
                 *
(illumina seq)  GTTTGTACCTCG
 


(true barcode)  GCAAACACGTTT
                **    **
(illumina seq)  --AAAC--GTTTGGCT


The first four look nice and simple, a single deletion has occurred in the Illumina read. This causes a frame shift in the sequence and process_radtags counts everything downstream of the deletion as a mismatch while TagGD is able to insert a gap and properly align the rest of the barcode.

The last two cases are much more problematic. In the penultimate example, there is an insertion into the barcode, and in the last example, there are multiple possible deletions.

>From a code point of view, doing the full dynamic programming alignment on the barcode is more intensive. We could potentially add more threads to the program to do this, but I'm not sure it is worth the computational cost. More importantly, adding gapped alignments allows us to significantly transform the barcode as is demonstrated in the last example. I'm not sure I trust the outcome of gapped alignments to assign my sequence to the proper source. Further, if an indel occurs on the Illumina flow cell, what does this mean? One of the problems with Illumina is the propensity of sequences in a cluster of sequences on the flow cell to come out of phase with one another far into the sequencing process. If we see an indel in the first few bases of the sequence does this imply that the sequence is already out of phase? If so, it is likely not trustworthy.

I'm definitely happy to hear other people's opinions on this topic.

Best,

julian

March 2, 2015 at 9:37 PM
March 2, 2015 at 12:20 PM
Hi Kao Chi,

Thanks very much for your detailed analysis. I would like to get to the bottom of any discrepancies. Before I do, however, I just want to verify one thing.

When you specify the barcode distance to process_radtags, it is not the edit distance, the edit distance is one less than what is specified with --barcode_dist.

Specifying the barcode distance is a very old option for the program and it is meant to be interpreted as the distance between barcodes. Given a nucleotide distance of five between barcodes, we can correct anything with up to four errors without risking converting one barcode into another barcode.

So, to compare exactly against the other program you are using, you should specify --barcode_dist 6 to get an edit distance of 5.

Perhaps we should change this to be strictly the edit distance in the code base.

Anyway, if you could update your comparison this way, I will look into the discrepancies that remain.

Best,

julian




March 2, 2015 at 8:34 AM
Dear Julian and Stacks Users,

I have questions about barcode_dist and it rescure  algorithm.

Our ddRAD Data have one site designed 8 12bp barcodes, and each other's edit distance is 5.

Barcode A    AGTTGTGGACCT
Barcode B    CCCACTTGAAAT
Barcode C    CTCATGGTCTAT
Barcode D    GCAAACACGTTT
Barcode E    GCACGCTAAGCA
Barcode F    GTTGTACCTCGG
Barcode G    TATCACTCCGGG
Barcode H    TGGGAACAGGGA


And the other site is illumina Index like Peterson et al. (2012), Index 05 Sequencing Quality QC isn't passed,so it discarded.

01 CGATGT
02 TTAGGC
03 TGACCA
04 ACAGTG
06 TAGCTT
07 GGCTAC
08 CTTGTA
09 CCGTCC
10 GTGAAA
11 GTTTCG
12 ATTCCT

I try a small data set to test demultiplex method. A data set had Raw data first 10000 Paired-End Reads, contained 11 Index. Each Index had 20000 Reads.

And used Stacks process_radtags v1.27 compared with  Costea et al. (2013) - TagGD findIndex command which used edit ditstance to demultiplex.

高驥

unread,
Mar 5, 2015, 1:59:49 AM3/5/15
to stacks...@googlegroups.com, electr...@gmail.com, jcat...@illinois.edu
Hi Julian,

Thank you for your test and to find the difference in two program. 
 
I have another question for Index demultiplex in Index01 and index04

It still make me confused.

Do you recommand only put 1 Index file and run 11 times for each Index when I use process_radtags?

I don't know why Index04 reads (22896) are more than input reads (20000).

That's probably get wrong sources.


Thank you very much.

Sincerely,

             Kao Chi



Julian Catchen於 2015年3月5日星期四 UTC+8上午1時43分33秒寫道:

Julian Catchen

unread,
Mar 5, 2015, 11:50:04 AM3/5/15
to 高驥, stacks...@googlegroups.com
Hi Kao Chi,

This is happening because the number of allowed mismatches you are
specifying is 5. That's fine for the single-end barcode (which is 12bp
long), but it is a problem for the index barcode (which is only 6bp).
All of the "corrected" index barcodes are being placed into the first
correctable barcode that is found, which is ACAGTG.

In the new version, I have added two barcode mismatch parameters, one
for each barcode specified. When you instead use a mismatch of 5bp for
the inline barcode and 2bp for the index barcode, you get the expected
number, < 20,000.

I'll send you a link to the new code to try out. This change will be in
the next release.

Best,

julian
>> 高驥 <javascript:>
>> <https://lh3.googleusercontent.com/-GkgWhBr200w/VPUp8dS0LyI/AAAAAAAAAEA/r8IXgmN40DM/s1600/2015-03-03_112805.png>
>> Julian Catchen <javascript:>
>> March 2, 2015 at 12:20 PM
>> Hi Kao Chi,
>>
>> Thanks very much for your detailed analysis. I would like to get
>> to the bottom of any discrepancies. Before I do, however, I just
>> want to verify one thing.
>>
>> When you specify the barcode distance to process_radtags, it is
>> not the edit distance, the edit distance is one less than what is
>> specified with --barcode_dist.
>>
>> Specifying the barcode distance is a very old option for the
>> program and it is meant to be interpreted as the distance between
>> barcodes. Given a nucleotide distance of five between barcodes, we
>> can correct anything with up to four errors without risking
>> converting one barcode into another barcode.
>>
>> So, to compare exactly against the other program you are using,
>> you should specify --barcode_dist 6 to get an edit distance of 5.
>>
>> Perhaps we should change this to be strictly the edit distance in
>> the code base.
>>
>> Anyway, if you could update your comparison this way, I will look
>> into the discrepancies that remain.
>>
>> Best,
>>
>> julian
>>
>>
>>
>>
>> 高驥 <javascript:>
>> March 2, 2015 at 8:34 AM
>> Dear Julian and Stacks Users,
>>
>> I have questions about barcode_dist and it rescure algorithm.
>>
>> Our ddRAD Data have one site designed 8 12bp barcodes, and each
>> other's edit distance is 5.
>>
>> Barcode AAGTTGTGGACCT
>> Barcode BCCCACTTGAAAT
>> Barcode CCTCATGGTCTAT
>> Barcode DGCAAACACGTTT
>> Barcode E GCACGCTAAGCA
>> Barcode FGTTGTACCTCGG
>> Barcode G TATCACTCCGGG
>> Barcode HTGGGAACAGGGA
>>
>>
>> And the other site is illumina Index like Peterson/et al./ (2012),
>> Index 05 Sequencing Quality QC isn't passed,so it discarded.
>>
>> 01 CGATGT
>> 02 TTAGGC
>> 03 TGACCA
>> 04 ACAGTG
>> 06 TAGCTT
>> 07 GGCTAC
>> 08 CTTGTA
>> 09 CCGTCC
>> 10 GTGAAA
>> 11 GTTTCG
>> 12 ATTCCT
>>
>> I try a small data set to test demultiplex method. A data set had
>> Raw data first 10000 Paired-End Reads, contained 11 Index. Each
>> Index had 20000 Reads.
>>
>> And used Stacks process_radtags v1.27 compared with Costea/et al./
>> <https://lh3.googleusercontent.com/-HoFkt5bG-bA/VPRuB2O3b2I/AAAAAAAAADo/vxCU5RxZYj8/s1600/2015-03-02_220610.png>
Reply all
Reply to author
Forward
0 new messages