Extract_barcodes before importing data into QIIME2 Issue

1,206 views
Skip to first unread message

Alexis Walker

unread,
Nov 16, 2017, 3:37:46 PM11/16/17
to qiime...@googlegroups.com
Hello, 

I have paired end fastq.gz files, one forward (R1) and one reverse (R2). These files have not been demultiplexed.  I would like to import them into QIIME2 for demultiplexing/trimming/qc, but I do not have a barcodes file. I have read other answers pointing to the extract_barcodes.py script. I have run this script but I am very concerned about barcode length --bc1_len. Because the jargon here I think is a little confusing, I'll clarify what I think this script is doing and where my issues lie. 

I am assuming that "barcodes" in this script are used in two different ways,  the index primers and the Illumina sequence identifier, mine looks like this:
 @M03580:59:000000000-BHFJK:1:1101:12747:1451 1:N:0:CTCTGGTT+GTTTCCTT

So, I am thinking that the extract barcodes script is making a file with sequence identifiers and associated index primers (barcodes.fastq) and two files for R1 and R2 without the index primers?

If that is correct I am concerned about the barcode length (index primers) as my forward and reverse primers are varying lengths 5-8. When I input 8 as the bc_len, it seems that I'll get more than my actual index extracted from the file. Additionally when I try to match my actual indexes from my mapping file to the resulting barcodes.fastq output, they don't seem to match. 

Any help is much appreciated!

Thank you, 
Alexis

kyli...@gmail.com

unread,
Nov 16, 2017, 4:27:22 PM11/16/17
to Qiime 1 Forum
My data is just like yours. I think the "CTCTGGTT + GTTTCCTT" are two barcodes, each one for the forward and reverse sequences. Usually I would join the forward and reverse sequences first, and remove the "+" between them, and then extract barcodes whose length would be 16 and barcode type would be in the label. In the mapping file I would concatenate these two barcodes into one (because "+" is not allowed in the mapping file, at least in qiime 1), and then do the demultiplex. 

Alexis Walker

unread,
Nov 16, 2017, 4:43:48 PM11/16/17
to Qiime 1 Forum
The "CTCTGGTT + GTTTCCTT" doesn't actually match any of the primer indexes I used. Additionally, I was unable to merged the reads prior to extracting barcodes using joined_paired_ends because I kept getting a mapping file error, which even after validation using validate_mapping and using the corrected.txt, it still gave me the same error. Any thought on why barcodes wouldn't match those used and the mapping file issue?

Alexis Walker

unread,
Nov 16, 2017, 4:52:26 PM11/16/17
to Qiime 1 Forum
I am realizing now that the "CTCTGGTT + GTTTCCTT" barcodes are apoart of the seq identifier used for initial demultiplexing of libraries (used to discern different libraries). Since I am using the R1 and R2 from 1 library, all of the identifiers have the "CTCTGGTT + GTTTCCTT" .

 If I merge the sequences before extracting the barcodes, the barcode length still won't be exactly 16, because by actual primer indexes are individual lengths. So the merged barcodes (primer indexes) will be 10-16.

Colin Brislawn

unread,
Nov 16, 2017, 5:05:43 PM11/16/17
to Qiime 1 Forum
Hello folks,

This is the first time I have seen Illumina data using this barcode+barcode format inside of the fastq file. I'm cc'ed Tony, of the qiime devs who has worked with dual-indexed reads before, and who should be more help. If this is a new format Illumina is using, we absolutely need to support it in QIIME2, so I appreciate that you brought this to our attention.

Colin

TonyWalters

unread,
Nov 16, 2017, 5:34:35 PM11/16/17
to Qiime 1 Forum
Hello Alexis,

We don't have a built-in solution for this yet, but here is a custom script that could help out: https://gist.github.com/walterst/98ded207e50802ced85b736a2f78319c

As long as the barcodes are consistent lengths, you should be able to use the extract_barcodes.py option to pull the barcodes out of the header of the resulting filtered file.

For the barcodes not matching your expected barcodes, you'll probably want to check the full results first and make sure it's not just the first few barcodes in the fastq file that aren't matching-hopefully the above custom script can help with that.

Alexis Walker

unread,
Nov 16, 2017, 5:48:34 PM11/16/17
to Qiime 1 Forum
Thank you Tony for you reply. My index barcodes are of varying lengths so is there no way to extract barcodes to use my data set in qiime? Also, I'm not sure if this came across clearly enough but the "CTCTGGTT + GTTTCCTT" are not my between sample indexes, but between library indexes. I am only seeking to analyze L1ND_1308. See Image below.


TonyWalters

unread,
Nov 17, 2017, 2:36:43 AM11/17/17
to Qiime 1 Forum
Hello Alexis,

Apologies for not having a solution for this. I am indeed not fully understanding the constructs/sequencing design here-is this based on a particular article that we can reference to better handle this in the future (it appears that you havebo the TruSeq demultiplexing on top of another layer of barcodes in your data)? Did the sequencing center do any of their own custom approaches to processing the data?

-Tony
Reply all
Reply to author
Forward
0 new messages