About process_radtags barcode rescure

高驥

unread,

Mar 2, 2015, 9:34:04 AM3/2/15

to stacks...@googlegroups.com

Dear Julian and Stacks Users,

I have questions about barcode_dist and it rescure algorithm.

Our ddRAD Data have one site designed 8 12bp barcodes, and each other's edit distance is 5.

Barcode A AGTTGTGGACCT

Barcode B CCCACTTGAAAT

Barcode C CTCATGGTCTAT

Barcode D GCAAACACGTTT

Barcode E GCACGCTAAGCA

Barcode F GTTGTACCTCGG

Barcode G TATCACTCCGGG

Barcode H TGGGAACAGGGA

And the other site is illumina Index like Peterson et al. (2012), Index 05 Sequencing Quality QC isn't passed,so it discarded.

01 CGATGT

02 TTAGGC

03 TGACCA

04 ACAGTG

06 TAGCTT

07 GGCTAC

08 CTTGTA

09 CCGTCC

10 GTGAAA

11 GTTTCG

12 ATTCCT

I try a small data set to test demultiplex method. A data set had Raw data first 10000 Paired-End Reads, contained 11 Index. Each Index had 20000 Reads.

And used Stacks process_radtags v1.27 compared with Costea et al. (2013) - TagGD findIndex command which used edit ditstance to demultiplex.

But TagGD just find barcode and didn't filter bad quaility and restriction site.

Stacks process_radtags is more poweful and efficiency.

But when I used small dataset to find result different in Stacks and TagGD

I turn off quality filter and rad_check just rescure barcode_dist < 5.

I used the command in Stacks version 1.27 like this:

process_radtags -P -p ../ -o ./No_RAD_No_Quality_Filter -r -D -b ./Barcode.list.txt --inline_index --renz_1 pstI --renz_2 mspI --barcode_dist 5 --disable_rad_check

The log.file like attachment.

Partial like this:

File	Retained Reads	Low Quality	Ambiguous Barcodes	Ambiguous RAD-Tag	Total
Test_Spicy_pool_10_0317_P10_GTGAAA_R1_10.fastq	19888	0	112	0	20000
Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq	19890	0	110	0	20000
Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq	19820	0	180	0	20000
Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq	19872	0	128	0	20000
Test_Spicy_pool_2_0317_P2_TTAGGC_R1_02.fastq	19864	0	136	0	20000

Total Sequences	220000
Ambiguous Barcodes	1588
Low Quality	0
Ambiguous RAD-Tag	0
Retained Reads	218412

Barcode	Filename	Total	No RadTag	Low Quality	Retained
AGTTGTGGACCT-CGATGT	H_F2_008	2678	0	0	2678
CCCACTTGAAAT-CGATGT	H_F2_011	1906	0	0	1906
CTCATGGTCTAT-CGATGT	B_F2_014	434	0	0	434
GCAAACACGTTT-CGATGT	H_F2_015	1954	0	0	1954

GCACGCTAAGCA-CGATGT	B_F2_016	3314	3314
GTTGTACCTCGG-CGATGT	H_F2_017	3294	3294
TATCACTCCGGG-CGATGT	H_F2_018	2616	2616
TGGGAACAGGGA-CGATGT	H_F2_019	2098	2098

The results looks great.

But when I summary two methods result. I find very strange things in Index.

Summary like this:

I don't known why Index 01 reads only had 18294 reads, it should have 19872 reads.

And I check Test data Index. use this command to find Index 01's Index

cat Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

9189 CGATGT

620 CGATGC

114 CGATCT

14 CTATGT

11 AGATGT

9 CGATAT

7 CGATGA

6 CGTTGT

6 CGATGG

5 CGATTT

5 CGACGT

3 TGATGT

3 CGAAGT

3 CAATGT

2 CGAGGT

2 CCATGT

1 CGGTGT

It has perfect match CGATGT 9189 and dist 1 nt CGATGC 620 CGATCT 114 CTATGT 14 AGATGT 11...

First two (9189 + 620) X 2 = 19618 > 18294.

I don't know what happen in this condition.

I try different barcode_dist like 2 (default) , the result is less than 18294.

The other hand, Index 04 08 09 12 > 20000 reads. It didn't make sense. (One Index just has 20000 reads)

The Barcode_list is in attachments.

Small dataset download (17Mb)

https://drive.google.com/file/d/0BwWdkEsPEUCocTZTbXZNX1BmM0k/view?usp=sharing

Anyone have thoughts ?

Thanks.

Sincerely,

Kao Chi

process_radtags.log

Barcode.list.txt

Julian Catchen

unread,

Mar 2, 2015, 1:21:03 PM3/2/15

to stacks...@googlegroups.com, electr...@gmail.com

Hi Kao Chi,

Thanks very much for your detailed analysis. I would like to get to the
bottom of any discrepancies. Before I do, however, I just want to verify
one thing.

When you specify the barcode distance to process_radtags, it is not the
edit distance, the edit distance is one less than what is specified with
--barcode_dist.

Specifying the barcode distance is a very old option for the program and
it is meant to be interpreted as the distance between barcodes. Given a
nucleotide distance of five between barcodes, we can correct anything
with up to four errors without risking converting one barcode into
another barcode.

So, to compare exactly against the other program you are using, you
should specify --barcode_dist 6 to get an edit distance of 5.

Perhaps we should change this to be strictly the edit distance in the
code base.

Anyway, if you could update your comparison this way, I will look into
the discrepancies that remain.

Best,

julian

高驥 wrote:
> Dear Julian and Stacks Users,
>
> I have questions about barcode_dist and it rescure algorithm.
>
> Our ddRAD Data have one site designed 8 12bp barcodes, and each other's
> edit distance is 5.
>
> Barcode AAGTTGTGGACCT
> Barcode BCCCACTTGAAAT
> Barcode CCTCATGGTCTAT
> Barcode DGCAAACACGTTT
> Barcode EGCACGCTAAGCA
> Barcode FGTTGTACCTCGG
> Barcode GTATCACTCCGGG
> Barcode HTGGGAACAGGGA
>
>

> And the other site is illumina Index like Peterson/et al./ (2012), Index

> 05 Sequencing Quality QC isn't passed,so it discarded.
>
> 01 CGATGT
> 02 TTAGGC
> 03 TGACCA
> 04 ACAGTG
> 06 TAGCTT
> 07 GGCTAC
> 08 CTTGTA
> 09 CCGTCC
> 10 GTGAAA
> 11 GTTTCG
> 12 ATTCCT
>
> I try a small data set to test demultiplex method. A data set had Raw
> data first 10000 Paired-End Reads, contained 11 Index. Each Index had
> 20000 Reads.
>

> And used Stacks process_radtags v1.27 compared with Costea/et al./

> <https://lh3.googleusercontent.com/-HoFkt5bG-bA/VPRuB2O3b2I/AAAAAAAAADo/vxCU5RxZYj8/s1600/2015-03-02_220610.png>

高驥

unread,

Mar 2, 2015, 10:37:11 PM3/2/15

to stacks...@googlegroups.com, electr...@gmail.com, jcat...@illinois.edu

Hi Julian,

I specify barcode_dist 6 to run the program, the partial log file like this:

process_radtags -P -p ../ -o ./No_RAD_No_Quality_Filter -r -D -b ./Barcode.list.txt --inline_index --renz_1 pstI --renz_2 mspI --barcode_dist 6 --disable_rad_check
process_radtags version 1.27 executed 2015-03-03 10:08:03


File	Retained Reads	Low Quality	Ambiguous Barcodes	Ambiguous RAD-Tag	Total

Test_Spicy_pool_10_0317_P10_GTGAAA_R1_10.fastq	19908	0	92	0	20000
Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq	19904	0	96	0	20000
Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq	19842	0	158	0	20000
Test_Spicy_pool_1_0317_P1_CGATGT_R1_01.fastq

19888

0

112

0

20000

Test_Spicy_pool_2_0317_P2_TTAGGC_R1_02.fastq	19874	0	126	0	20000
Test_Spicy_pool_3_0317_P3_TGACCA_R1_03.fastq

19888

0

112

0

20000

Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq	19884	0	116	0	20000
Test_Spicy_pool_6_0317_P6_TAGCTT_R1_06.fastq	19876	0	124	0	20000
Test_Spicy_pool_7_0317_P7_GGCTAC_R1_07.fastq

19846

0

154

0

20000

Test_Spicy_pool_8_0317_P8_CTTGTA_R1_08.fastq	19880	0	120	0	20000
Test_Spicy_pool_9_0317_P9_CCGTCC_R1_09.fastq	19826	0	174	0	20000

Total Sequences	220000
Ambiguous Barcodes	1384

Low Quality	0
Ambiguous RAD-Tag	0

Retained Reads

218616


Barcode	Filename	Total	No RadTag	Low Quality	Retained

AGTTGTGGACCT-CGATGT	H_F2_008	2680	2680
CCCACTTGAAAT-CGATGT	H_F2_011	1902	1902
CTCATGGTCTAT-CGATGT	B_F2_014	432	432
GCAAACACGTTT-CGATGT	H_F2_015	1960	1960

GCACGCTAAGCA-CGATGT

B_F2_016

3314

0

3314

GTTGTACCTCGG-CGATGT	H_F2_017	3292	0	0	3292
TATCACTCCGGG-CGATGT	H_F2_018	2614	0	0	2614

TGGGAACAGGGA-CGATGT

H_F2_019

2098

0

2098

And then, summary like this table:

Index 01 reads number still looks strange.

Then, Index 04 reads number is 22896 , it's too many reads in this index.

I check other Test data Index. use command to find Index readcounts.

cat Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

10000 ACAGTG

The other indexs,

cat Test_Spicy_pool_4_0317_P4_ACAGTG_R1_04.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

10000 ACAGTG

cat Test_Spicy_pool_7_0317_P7_GGCTAC_R1_07.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

10000 GGCTAC

.....

cat Test_Spicy_pool_11_0317_P11_GTTTCG_R1_11.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

10000 GTTTCG

cat Test_Spicy_pool_12_0317_P12_ATTCCT_R1_12.fastq | awk -F: '{if(NR%4==1){print $10}}' | sort | uniq -c | sort -n -r

9795 ATTCCT

40 CTTCCT

27 ATTCCA

20 ATTCNT

12 ATTTCT

12 ATTCCN

12 ATTCAT

11 ATTACT

9 GTTCCT

9 ATCCCT

8 ACTCCT

7 ATACCT

7 AATCCT

6 ATTCCC

5 TTTCCT

5 ATTCTT

5 AGTCCT

4 ATGCCT

3 ATTCCG

2 ATTGCT

1 ATTCGT

It seems not problem in Index.

Please find the discrepancies between TagGD and Stacks process_radtags

And tell me why Index reads will demultiplex like the result.

Hope edit distance option can be used in stacks.

It's a good tool to tread ddRAD or other GBS Data.

Thank you very much !!

Sincerely,

Kao Chi

Julian Catchen於 2015年3月3日星期二 UTC+8上午2時21分03秒寫道：

process_radtags.log

Julian Catchen

unread,

Mar 4, 2015, 12:43:33 PM3/4/15

to stacks...@googlegroups.com, electr...@gmail.com

Hi Kao Chi,

Again, thanks very much for the detailed test case. This makes it much easier to tackle your question.

I narrowed down your test case to just look at the single-end data from one of your 11 files. Given the 11, 12bp barcodes you supplied, I found 27 cases out of the 10,000 reads where process_radtags could not recover a barcode but the TagGD program could. It turns out that TagGD is doing a full sequence alignment on each barcode so it has the ability to detect insertions and deletions in the barcode, while process_radtags only looks at mismatches in the barcode.

These 27 cases appear to have various indels in the barcode sequence. Here are a few examples:

(true barcode) GCACGCTAAGCA
    *
(illumina seq) GCACC-TAAGCAT

(true barcode) AGTTGTGGACCT
   *
(illumina seq) A-TTGTGGACCT

(true barcode) GTTGTACCTCGG
       *
(illumina seq) GTT-TACCTCGGT

(true barcode) AGTTGTGGACCT
*
(illumina seq) AG-TGTGGACCTT

(true barcode) G-TTGTACCTCGG
   *
(illumina seq) GTTTGTACCTCG

(true barcode) GCAAACACGTTT
**    **
(illumina seq) --AAAC--GTTTGGCT

The first four look nice and simple, a single deletion has occurred in the Illumina read. This causes a frame shift in the sequence and process_radtags counts everything downstream of the deletion as a mismatch while TagGD is able to insert a gap and properly align the rest of the barcode.

The last two cases are much more problematic. In the penultimate example, there is an insertion into the barcode, and in the last example, there are multiple possible deletions.

>From a code point of view, doing the full dynamic programming alignment on the barcode is more intensive. We could potentially add more threads to the program to do this, but I'm not sure it is worth the computational cost. More importantly, adding gapped alignments allows us to significantly transform the barcode as is demonstrated in the last example. I'm not sure I trust the outcome of gapped alignments to assign my sequence to the proper source. Further, if an indel occurs on the Illumina flow cell, what does this mean? One of the problems with Illumina is the propensity of sequences in a cluster of sequences on the flow cell to come out of phase with one another far into the sequencing process. If we see an indel in the first few bases of the sequence does this imply that the sequence is already out of phase? If so, it is likely not trustworthy.

I'm definitely happy to hear other people's opinions on this topic.

Best,

julian

高驥

March 2, 2015 at 9:37 PM

Julian Catchen

March 2, 2015 at 12:20 PM

Hi Kao Chi,

Thanks very much for your detailed analysis. I would like to get to the bottom of any discrepancies. Before I do, however, I just want to verify one thing.

When you specify the barcode distance to process_radtags, it is not the edit distance, the edit distance is one less than what is specified with --barcode_dist.

Specifying the barcode distance is a very old option for the program and it is meant to be interpreted as the distance between barcodes. Given a nucleotide distance of five between barcodes, we can correct anything with up to four errors without risking converting one barcode into another barcode.

So, to compare exactly against the other program you are using, you should specify --barcode_dist 6 to get an edit distance of 5.

Perhaps we should change this to be strictly the edit distance in the code base.

Anyway, if you could update your comparison this way, I will look into the discrepancies that remain.

Best,

julian

高驥

March 2, 2015 at 8:34 AM

Dear Julian and Stacks Users,

I have questions about barcode_dist and it rescure algorithm.

Our ddRAD Data have one site designed 8 12bp barcodes, and each other's edit distance is 5.

Barcode A AGTTGTGGACCT
Barcode B CCCACTTGAAAT
Barcode C CTCATGGTCTAT
Barcode D GCAAACACGTTT
Barcode E GCACGCTAAGCA
Barcode F GTTGTACCTCGG
Barcode G TATCACTCCGGG
Barcode H TGGGAACAGGGA

And the other site is illumina Index like Peterson et al. (2012), Index 05 Sequencing Quality QC isn't passed,so it discarded.

01 CGATGT
02 TTAGGC
03 TGACCA
04 ACAGTG
06 TAGCTT
07 GGCTAC
08 CTTGTA
09 CCGTCC
10 GTGAAA
11 GTTTCG
12 ATTCCT

I try a small data set to test demultiplex method. A data set had Raw data first 10000 Paired-End Reads, contained 11 Index. Each Index had 20000 Reads.

And used Stacks process_radtags v1.27 compared with Costea et al. (2013) - TagGD findIndex command which used edit ditstance to demultiplex.

高驥

unread,

Mar 5, 2015, 1:59:49 AM3/5/15

to stacks...@googlegroups.com, electr...@gmail.com, jcat...@illinois.edu

Hi Julian,

Thank you for your test and to find the difference in two program.

I have another question for Index demultiplex in Index01 and index04

It still make me confused.

Do you recommand only put 1 Index file and run 11 times for each Index when I use process_radtags?

I don't know why Index04 reads (22896) are more than input reads (20000).

That's probably get wrong sources.

Thank you very much.

Sincerely,

Kao Chi

Julian Catchen於 2015年3月5日星期四 UTC+8上午1時43分33秒寫道：

Julian Catchen

unread,

Mar 5, 2015, 11:50:04 AM3/5/15

to 高驥, stacks...@googlegroups.com

Hi Kao Chi,

This is happening because the number of allowed mismatches you are
specifying is 5. That's fine for the single-end barcode (which is 12bp
long), but it is a problem for the index barcode (which is only 6bp).
All of the "corrected" index barcodes are being placed into the first
correctable barcode that is found, which is ACAGTG.

In the new version, I have added two barcode mismatch parameters, one
for each barcode specified. When you instead use a mismatch of 5bp for
the inline barcode and 2bp for the index barcode, you get the expected
number, < 20,000.

I'll send you a link to the new code to try out. This change will be in
the next release.

Best,

julian

>> 高驥 <javascript:>

>> <https://lh3.googleusercontent.com/-GkgWhBr200w/VPUp8dS0LyI/AAAAAAAAAEA/r8IXgmN40DM/s1600/2015-03-03_112805.png>

>> Julian Catchen <javascript:>

>> March 2, 2015 at 12:20 PM
>> Hi Kao Chi,
>>
>> Thanks very much for your detailed analysis. I would like to get
>> to the bottom of any discrepancies. Before I do, however, I just
>> want to verify one thing.
>>
>> When you specify the barcode distance to process_radtags, it is
>> not the edit distance, the edit distance is one less than what is
>> specified with --barcode_dist.
>>
>> Specifying the barcode distance is a very old option for the
>> program and it is meant to be interpreted as the distance between
>> barcodes. Given a nucleotide distance of five between barcodes, we
>> can correct anything with up to four errors without risking
>> converting one barcode into another barcode.
>>
>> So, to compare exactly against the other program you are using,
>> you should specify --barcode_dist 6 to get an edit distance of 5.
>>
>> Perhaps we should change this to be strictly the edit distance in
>> the code base.
>>
>> Anyway, if you could update your comparison this way, I will look
>> into the discrepancies that remain.
>>
>> Best,
>>
>> julian
>>
>>
>>
>>

>> 高驥 <javascript:>

>> March 2, 2015 at 8:34 AM
>> Dear Julian and Stacks Users,
>>
>> I have questions about barcode_dist and it rescure algorithm.
>>
>> Our ddRAD Data have one site designed 8 12bp barcodes, and each
>> other's edit distance is 5.
>>

>> Barcode AAGTTGTGGACCT
>> Barcode BCCCACTTGAAAT
>> Barcode CCTCATGGTCTAT
>> Barcode DGCAAACACGTTT
>> Barcode E GCACGCTAAGCA
>> Barcode FGTTGTACCTCGG
>> Barcode G TATCACTCCGGG
>> Barcode HTGGGAACAGGGA
>>
>>

>> And the other site is illumina Index like Peterson/et al./ (2012),

>> Index 05 Sequencing Quality QC isn't passed,so it discarded.
>>
>> 01 CGATGT
>> 02 TTAGGC
>> 03 TGACCA
>> 04 ACAGTG
>> 06 TAGCTT
>> 07 GGCTAC
>> 08 CTTGTA
>> 09 CCGTCC
>> 10 GTGAAA
>> 11 GTTTCG
>> 12 ATTCCT
>>
>> I try a small data set to test demultiplex method. A data set had
>> Raw data first 10000 Paired-End Reads, contained 11 Index. Each
>> Index had 20000 Reads.
>>

>> And used Stacks process_radtags v1.27 compared with Costea/et al./

>> <https://lh3.googleusercontent.com/-HoFkt5bG-bA/VPRuB2O3b2I/AAAAAAAAADo/vxCU5RxZYj8/s1600/2015-03-02_220610.png>

Reply all

Reply to author

Forward