Hi Kao Chi,
This is happening because the number of allowed mismatches you are
specifying is 5. That's fine for the single-end barcode (which is 12bp
long), but it is a problem for the index barcode (which is only 6bp).
All of the "corrected" index barcodes are being placed into the first
correctable barcode that is found, which is ACAGTG.
In the new version, I have added two barcode mismatch parameters, one
for each barcode specified. When you instead use a mismatch of 5bp for
the inline barcode and 2bp for the index barcode, you get the expected
number, < 20,000.
I'll send you a link to the new code to try out. This change will be in
the next release.
Best,
julian
>> 高驥 <javascript:>
>> <
https://lh3.googleusercontent.com/-GkgWhBr200w/VPUp8dS0LyI/AAAAAAAAAEA/r8IXgmN40DM/s1600/2015-03-03_112805.png>
>> Julian Catchen <javascript:>
>> March 2, 2015 at 12:20 PM
>> Hi Kao Chi,
>>
>> Thanks very much for your detailed analysis. I would like to get
>> to the bottom of any discrepancies. Before I do, however, I just
>> want to verify one thing.
>>
>> When you specify the barcode distance to process_radtags, it is
>> not the edit distance, the edit distance is one less than what is
>> specified with --barcode_dist.
>>
>> Specifying the barcode distance is a very old option for the
>> program and it is meant to be interpreted as the distance between
>> barcodes. Given a nucleotide distance of five between barcodes, we
>> can correct anything with up to four errors without risking
>> converting one barcode into another barcode.
>>
>> So, to compare exactly against the other program you are using,
>> you should specify --barcode_dist 6 to get an edit distance of 5.
>>
>> Perhaps we should change this to be strictly the edit distance in
>> the code base.
>>
>> Anyway, if you could update your comparison this way, I will look
>> into the discrepancies that remain.
>>
>> Best,
>>
>> julian
>>
>>
>>
>>
>> 高驥 <javascript:>
>> March 2, 2015 at 8:34 AM
>> Dear Julian and Stacks Users,
>>
>> I have questions about barcode_dist and it rescure algorithm.
>>
>> Our ddRAD Data have one site designed 8 12bp barcodes, and each
>> other's edit distance is 5.
>>
>> Barcode AAGTTGTGGACCT
>> Barcode BCCCACTTGAAAT
>> Barcode CCTCATGGTCTAT
>> Barcode DGCAAACACGTTT
>> Barcode E GCACGCTAAGCA
>> Barcode FGTTGTACCTCGG
>> Barcode G TATCACTCCGGG
>> Barcode HTGGGAACAGGGA
>>
>>
>> And the other site is illumina Index like Peterson/et al./ (2012),
>> Index 05 Sequencing Quality QC isn't passed,so it discarded.
>>
>> 01 CGATGT
>> 02 TTAGGC
>> 03 TGACCA
>> 04 ACAGTG
>> 06 TAGCTT
>> 07 GGCTAC
>> 08 CTTGTA
>> 09 CCGTCC
>> 10 GTGAAA
>> 11 GTTTCG
>> 12 ATTCCT
>>
>> I try a small data set to test demultiplex method. A data set had
>> Raw data first 10000 Paired-End Reads, contained 11 Index. Each
>> Index had 20000 Reads.
>>
>> And used Stacks process_radtags v1.27 compared with Costea/et al./
>> <
https://lh3.googleusercontent.com/-HoFkt5bG-bA/VPRuB2O3b2I/AAAAAAAAADo/vxCU5RxZYj8/s1600/2015-03-02_220610.png>