Same demultiplex sequence count with different max_barcode_error parameter

67 views
Skip to first unread message

Sanjeev Sariya

unread,
Aug 26, 2016, 2:36:38 PM8/26/16
to Qiime 1 Forum
Hi QIIME developers,

I have been having difficult time with barcodes in data set. Illumina 300PE V3-V4 region. A lot of reads in QIIME logs are thrown away because bar-codes not found. In order to identify root cause I ran demultiplex with different max_barcode_errors. Seems I get same output irrespective of the value I provide.
 

qiime version 1.9.1


R1=seqs.fq

R2=R2R3.paired.fq


split_libraries_fastq.py -i $R1 -o ./mismatach_0 -b $R2 -m ./edited_mapping_file.txt --max_barcode_errors 0 --sequence_max_n 0 --phred_quality_threshold 20 --max_bad_run_length 300 --min_per_read_length_fraction 0.75 --barcode_type 16


Sequences generated: 44233728


split_libraries_fastq.py -i $R1 -o ./mismatach_1 -b $R2 -m ./edited_mapping_file.txt --max_barcode_errors 1 --sequence_max_n 0 --phred_quality_threshold 20 --max_bad_run_length 300 --min_per_read_length_fraction 0.75 --barcode_type 16


Sequences generated: 44233728


split_libraries_fastq.py -i $R1 -o ./mismatach_2 -b $R2 -m ./edited_mapping_file.txt --max_barcode_errors 2 --sequence_max_n 0 --phred_quality_threshold 20 --max_bad_run_length 300 --min_per_read_length_fraction 0.75 --barcode_type 16


Sequences generated: 44233728


I get exact same logs for all three runs. It is as following:


Total number of input sequences: 91697563

Barcode not in mapping file: 44642128

Read too short after quality truncation: 2813038

Count of N characters exceeds limit: 8669

Illumina quality digit = 0: 0

Barcode errors exceed max: 0


I don't know what is going wrong. I was expecting difference in output for sequence counts and similar changes in logs.


Kindly help me through this.

Jose Antonio Navas Molina

unread,
Aug 26, 2016, 10:37:50 PM8/26/16
to Qiime 1 Forum
Hi Sanjeev,

How many reads do you have for each sample? If you have enough reads per sample this will be ok. Note that in your run there are other things, such PhiX, that are not barcoded, and those get added in the "BArcode not in mapping file" field.

For the counts of sequences generated, it seems okay.

Cheers,

Sanjeev Sariya

unread,
Aug 27, 2016, 10:16:15 AM8/27/16
to Qiime 1 Forum
Dear Jose,

Thank you for your reply. Below are few numbers:

1305 samples
Max reads in sample: 149451
Min reads in sample: 29
Median reads: 31593
Average reads: 33895.577
Total reads: 44233728

I did this exercise to see if I can rescue reads by changing value for flag max_barcode_errors. I ran it with value as 4 and got same reads. 

Does this flag have no affect on outcome? 
Data is from a sequencing facility away from my location, tweaking this flag has no affect then I wonder what is to be done.

Thank you again for your reply and time.
--Sanjeev


Jose Antonio Navas Molina

unread,
Aug 29, 2016, 10:13:04 AM8/29/16
to Qiime 1 Forum
Hi Sanjeev,

The "max_barcode_errors" parameter applies to error-correcting barcodes, such as hamming or golay. You can read more about them here, here and here.

From the outputs I see that there is a lot of sequences recovered. The high number of "Barcode not in mapping file" reads does not surprise me as it is normal to have other stuff in your run file that does not necessarily belong to your samples.

Hope this helps!

Sanjeev Sariya

unread,
Aug 29, 2016, 10:25:44 AM8/29/16
to Qiime 1 Forum
Hi Jose,

Good morning.!
Thank you for sharing insightful papers. 

To get to the root of these bar-code I used mapping file for all the samples in the run. I requested mapping file from sequencing center for the entire run. :)

If I look at the number it seems I'm loosing 48% of my reads due to barcode not in mapping file.
100*44642128/91697563 = ~48% of reads thrown away

Didn't know max_barcode_errors was applicable to golay and hamming.

Is there a way I can rescue this huge chunk of reads from thrown? Would appreciate any pointers from you.

Thank you for your time and reply.
--
Sanjeev

Jose Antonio Navas Molina

unread,
Aug 29, 2016, 10:40:05 AM8/29/16
to Qiime 1 Forum
Hi Sanjeev,

I don't see an easy way of recovering those reads. It is possible that those reads do not even belong to your study. Are you sure that the sequencing center did not aggregate samples from other studies to your run? 

Cheers,

Sanjeev Sariya

unread,
Aug 29, 2016, 11:18:53 AM8/29/16
to Qiime 1 Forum
Dear Jose,

Yes, as per my conversation I've mapping file for entire run. 
Run had 1/3 samples for our lab and remaining for other studies/labs. 

Thank you.
--Sanjeev


Jose Antonio Navas Molina

unread,
Aug 29, 2016, 11:30:50 AM8/29/16
to Qiime 1 Forum
Hi Sanjeev,

If only 1/3 of the samples are for your lab, the other 42% of the sequences may belong to those other studies/labs.

Makes sense?

Daniel Laubitz

unread,
Aug 29, 2016, 11:47:20 AM8/29/16
to Qiime 1 Forum
Hi Sanjeev,
it looks like you have 1305 samples in your study and that number is 1/3 of entire run, right? So it means that they pooled together almost 4000 samples in 1 run? It looks like a lot! I have never seen so many samples in 1 run. Just curiosity. 
Best,
Daniel

Sanjeev Sariya

unread,
Aug 29, 2016, 12:07:33 PM8/29/16
to Qiime 1 Forum
Dear Daniel,

Thank you for following up on this query.

400 samples belong to my lab of these 1300 samples. 

Best,
Sanjeev

Daniel Laubitz

unread,
Aug 30, 2016, 10:29:42 AM8/30/16
to Qiime 1 Forum
Thanks Sanjeev. So it looks like I misunderstood, the total number of samples was 1300. It's a lot, I think. Usually I do not sequence more than 500 samples in 1 run. Anyway, it's not a subject of this thread. Thanks!
Daniel
 

Sanjeev Sariya

unread,
Aug 30, 2016, 4:49:31 PM8/30/16
to Qiime 1 Forum
Hi Jose,

I see that reads (which passed threshold) And not demultiplexed are not dumped into any file. Do you guys plan to add this in future release which could allow users to take a look into thrown data? 

Thank you,
Sanjeev

Colin Brislawn

unread,
Aug 30, 2016, 6:17:05 PM8/30/16
to Qiime 1 Forum
Hello Sanjeev,

Take a look at the --retain_unassigned_reads flag of split_libraries_fastq.py. I think you could combine that flag with a grep command to capture the sequences you are looking for. 

--retain_unassigned_reads
Retain sequences which don ’t map to a barcode in the mapping file (sample ID will be “Unassigned”) [default: False]

Colin  

Sanjeev Sariya

unread,
Aug 31, 2016, 2:26:09 PM8/31/16
to Qiime 1 Forum
Dear Colin,

Thank you very much for looking into my query.
That (retaining) was really helpful; exactly what I needed. :)

Cheers!
Sanjeev

Reply all
Reply to author
Forward
0 new messages