Mapping file format for dual indexed, paired end reads and pre-processing questions

2,177 views
Skip to first unread message

Sara Dunaj

unread,
Dec 6, 2016, 9:05:48 AM12/6/16
to qiime...@googlegroups.com

Hi All,

 

I am struggling to pre-process my 16S rRNA Illumina Sequening data with QIIME. I have several issues that I have not been able to find specific answers for:

 

I have 4 files from the sequencing - read 1, read 2, index 1 and index 2 (MiSeq Paired End - 2x 250 cycle).

 

Workflow Plan:

1) Extract Barcodes (extract_barcodes.py):  with the option to re-orientate reads (I am finding the reverse complement of my i7 adaptor / linker/ pad/ and barcodes at the beginning of some of my reads in the read 1 file and vice versa in the read 2 file (but reverse complement i5 adapter and forward primer instead)

 

2) Join Paired Ends (join_paired_ends.py): with option to update the index / barcode reads file to match the surviving joined pairs.

 

3) Split libraries (split_libraries.py) :(Demulitplex and QC - with option z- to remove the reverse primer (and adapter / linker / pad/ sequence)

 

My Questions:

1) I am struggling how to see how the extract barcodes script helps me - in QIIME's website it says:

for two index/barcode reads and two fastq reads...

This situation can be treated as a special case of paired-end reads. One could supply the index files (labeled as index1.fastq, index2.fastq) and use the --input_type barcode_paired_end:

 

i.e.: extract_barcodes.py --input_type barcode_paired_end -f index1.fastq -r index2.fastq --bc1_len 6 --bc2_len 6 -o parsed_barcodes/

 

The output barcodes.fastq file would be used for downstream processing, and the reads1 and reads2 files could be ignored.... (this is what I don't understand... I need the original read 1 and 2 to join the reads and then demultiplex samples in the other 2 scripts, correct?) Or is this referring to the output files from this script? Also, I would like to run the extract barcode script to re-orientate my reads. Is this possible when I need to run it with my index read files? Or do I need to re-run the script to do this - but then wont this generate an additional barcodes file? If so, how to I handle that additional barcode file?

 

2) The mapping file for the split_libraries.py script- THIS is my biggest issue. How do I list both barcodes when the formatting and script allows for only one barcode column? Should I concatenate them? Do I need to reverse complement the barcodes? How do others handle duel barcodes with this script and mapping file format? I have been reading other pages but I can't find an answer or example on how best to handle this.

 

3) Any other examples or resources to handle paired end illumina miseq data in QIIME for first time users - specifically for those with 4 original sequencing files - 2 read files and 2 index read files.

 

Thank you in advance!

 

Sara

Sara Dunaj

unread,
Dec 6, 2016, 10:56:28 PM12/6/16
to Qiime 1 Forum
UPDATE:

I ran the extract barcodes and join paired end scripts and used that output for the split libraries fastq script... I couldn't not get the split libraries script to work... Can someone possibly help me troubleshoot why this script didn't work? Below are the scripts /commands I ran:

MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ extract_barcodes.py --input_type barcode_paired_end -f Trunc_Index1.fastq -r Trunc_Index2.fastq --bc1_len 8 --bc2_len 8 -o parsed_barcodes/

MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ join_paired_ends.py -f /Users/Sara_Jeanne/Desktop/QIIME/Trunc_Read1.fastq -r /Users/Sara_Jeanne/Desktop/QIIME/Trunc_Read2.fastq -b /Users/Sara_Jeanne/Desktop/QIIME/parsed_barcodes/barcodes.fastq -o /Users/Sara_Jeanne/Desktop/QIIME/fastq-join_joined

First pass with Split_Libraries_Fastq.py:
MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o slout_q20/ --store_qual_scores -q 19
Error in split_libraries_fastq.py: Some or all barcodes are not valid golay codes. Do they need to be reverse complemented? If these are not golay barcodes pass --barcode_type 12 to disable barcode error correction, or pass --barcode_type # if the barcodes are not 12 base pairs, where # is the size of the barcodes. Invalid codes:
    CGTGAGTGGGAGACTA ACGACGTGACCTACTG TCATCGAGCTACGCAG CGTGAGTGAGCGCTAT ACGACGTGCAGTGAGT GGATATCTCGTACTCA GGATATCTGGAGACTA ACGACGTGCTACGCAG CGTCGCTAAGCGCTAT CGTCGCTACAGTGAGT GGATATCTAGTCTAGA CGTGAGTGAGTCTAGA TCATCGAGGGAGACTA GGATATCTCTACGCAG TCATCGAGCGTACTCA TCATCGAGACCTACTG CGTCGCTACTACGCAG ATATACACACCTACTG TCATCGAGCAGTGAGT ACGACGTGGGAGACTA ACGACGTGCGTACTCA CGTGAGTGACCTACTG CGTGAGTGCTACGCAG GGATATCTCAGTGAGT CGTGAGTGCAGTGAGT CTGCGTGTAGCGCTAT TCATCGAGAGCGCTAT CTGCGTGTACCTACTG GGATATCTAGCGCTAT CTGCGTGTAGTCTAGA GGATATCTACCTACTG ACGACGTGAGTCTAGA CGTGAGTGCGTACTCA ACGACGTGAGCGCTAT TCATCGAGAGTCTAGA CGTCGCTAACCTACTG


Then I tried again adding option for Barcode Type :

MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o slout_q20/ --store_qual_scores --barcode_type 8 -q 19
/macqiime/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

I tried again with 16 for barcode_type option:

MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o slout_q20/ --store_qual_scores --barcode_type 16 -q 19
/macqiime/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

Then again with reverse complementing the barcodes in the mapping file:

MacQIIME Sara-Dunajs-MacBook-Pro:QIIME $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o slout_q20/ --store_qual_scores --rev_comp_mapping_barcodes -q 19
Error in split_libraries_fastq.py: Some or all barcodes are not valid golay codes. Do they need to be reverse complemented? If these are not golay barcodes pass --barcode_type 12 to disable barcode error correction, or pass --barcode_type # if the barcodes are not 12 base pairs, where # is the size of the barcodes. Invalid codes:
    CAGTAGGTGTGTATAT TGAGTACGCACGTCGT ACTCACTGAGATATCC ATAGCGCTTAGCGACG CTGCGTAGCTCGATGA TAGTCTCCCACGTCGT TGAGTACGCACTCACG TGAGTACGAGATATCC ACTCACTGCTCGATGA CAGTAGGTAGATATCC TAGTCTCCAGATATCC CAGTAGGTACACGCAG CTGCGTAGCACGTCGT ACTCACTGCACTCACG ACTCACTGCACGTCGT CAGTAGGTCTCGATGA ATAGCGCTACACGCAG ATAGCGCTCTCGATGA TCTAGACTCACGTCGT ATAGCGCTCACTCACG CAGTAGGTCACGTCGT ATAGCGCTAGATATCC ATAGCGCTCACGTCGT TGAGTACGCTCGATGA CAGTAGGTCACTCACG CTGCGTAGCACTCACG TCTAGACTCTCGATGA CTGCGTAGAGATATCC TAGTCTCCCTCGATGA TCTAGACTCACTCACG ACTCACTGTAGCGACG TAGTCTCCCACTCACG TCTAGACTAGATATCC CTGCGTAGTAGCGACG CAGTAGGTTAGCGACG TCTAGACTACACGCAG

I also tried the reverse complement mapping barcodes option along with the Barcode_Type option of 16... still no success.

Any help on how to proceed with the split libraries script would be greatly appreciated.

Thank you!

Sara

TonyWalters

unread,
Dec 7, 2016, 1:58:10 AM12/7/16
to Qiime 1 Forum
The mean of an empty slice error indicates that it's not matching any of your barcodes (or all reads are being filtered out for other quality issues). The barcocde_type should be a numeric type, e.g. 16, and not the default golay for your case.

However, you've got a more complicated situation than normal, as you have separate barcode reads AND reads (notionally both the barcodes and the read1/read2) that are out of orientation. I think this will take custom scripts to process, as the special case of paired-end reads approach won't be able to orient your other read1/read2 files.

Can you first confirm that you can find your barcodes in your raw index reads, i.e., if you use some of the 8 base pair barcodes you can use a grep command to find them, e.g.:
grep -c AATTCCGG barcodes.fastq

Sara Dunaj

unread,
Dec 7, 2016, 9:44:20 AM12/7/16
to Qiime 1 Forum
Hi Tony,

I appreciate your help. I am quite novice at this and I am uncertain on how to make up a custom script to do this. I did a QC of my raw reads prior to running the scripts and I am able to find my barcodes in the raw index reads. I am working with a truncated set to test out these scripts with my data.

Would it be better to use BaseSpace or ask the sequencing center to demultiplex my reads prior to moving forward with QIIME?

Thank you for your help,

Sara

TonyWalters

unread,
Dec 7, 2016, 11:24:15 AM12/7/16
to Qiime 1 Forum
Hello Sara,

It might be worth asking the sequencing center if they can demultiplex the reads-if they do this, that will end up giving you reads that are one read1 and one read2 fastq per sample. It does present other processing challenges, but those could be overcome.

But we may be able to process the data using scripts that already exist, albeit outside of QIIME.

Here's is how I'm picturing your reads from this email thread:
Index1: 8 base pair barcodes
index2: 8 base pair barcodes
(with about half of the reads in the reverse orientation)

read 1:
I7 + PCR primer site + rest of amplicon read
read 2:
I5 + PCR primer site + read of amplicon read

Ideally, we need to stick the 8 base pairs for index 1 and the 8 base pairs from index 2 together into a 16 base pair read after orienting the reads, so you get 16 base pair barcodes that match up to the combined barcodes you have in your mapping file (barcode 1 + barcode 2 under BarcodeSequence). Additionally, orienting the read1 and read2 could be useful, for cutting out the undesired sites (I7/I5 adapter sites and PCR primer sites).

I propose this approach-use this custom script to stick the index 1 barcodes to the front of the read 1 reads and the index 2 reads to the beginning of the read2 reads: https://gist.github.com/walterst/7326543

Then you should have combined barcode+read files that you can use for the paired-end barcodes option with extract_barcodes, and you can turn on the read orientation option get the reads in the correct direction. It should create the 16 base pair (orientated) barcode reads from this, plus oriented read1/2 files. You would want to manually check some of the reads to make sure they get oriented in this step. Then you would want to get rid of the adapter + primer sites-as long as you have consistent lengths of these regions (not using a phased scheme), you could call extract_barcodes.py a second time on the oriented read1/2 files from the first call, this time specifying the bc1_len and bc2_len parameters to be the length of the I7 adapter + forward primer length and I5 adapter + reverse primer length respectively. The resulting read1/2 files from this second call should have these PCR construct regions removed.

Then you could do the read stitching step with join_paired_ends.py and these final read1/2 fastq files (make sure to pass the barcodes file from the first call to extract_barcodes.py with -b for join_paired_ends.py so you get a filtered version of it as well), followed by split_libraries_fastq.py --barcode_type 16. Hopefully it will retain reads correctly this time, but you probably want to look at some of the 16 base pair barcodes that are created from the first call to extract_barcodes.py to make sure the resulting barcodes are matching those in your mapping file.

Hope this helps, and sorry it isn't a simple answer.

-Tony

chuanwu wang

unread,
Dec 7, 2016, 11:27:53 AM12/7/16
to Qiime 1 Forum
Hi Tony,
I have the same question as Sara mentioned.
I run grep -c for my R1 read, I found the number is 97. For my R2 read (reverse read), the number is 116. I could not know whether the bardcode sequence is in the front of each read or not?
Would you please let me know how to extract the barcode for both end read?
Thanks
Chuanwu

TonyWalters

unread,
Dec 7, 2016, 11:53:01 AM12/7/16
to Qiime 1 Forum
Hello Chuanwu,

When you grep for barcodes, I would expect that number to be much higher, at least tens of thousands for most barcodes in Illumina samples. You might try using the reverse complement of the barcode to see if it's out of orientation (http://www.bioinformatics.org/sms/rev_comp.html) and do the grep -c command for some of your reverse oriented barcodes. If you're still not seeing them, it might be a problem with the barcode sequences. Or it's possible that they already were removed from your reads, which happens if the sequencing center already demultiplexed your reads (in this case, you're likely to get a set of R1/R2 reads, one per sample).

chuanwu wang

unread,
Dec 7, 2016, 1:50:57 PM12/7/16
to Qiime 1 Forum
Hi Tony,
I think the files are demultiplexed already after calling Illumina. I used the grep -c to test one sample just now.
If that is the case, what kind of mapping file do I need to use?
Thanks
Chuanwu



TonyWalters

unread,
Dec 7, 2016, 1:59:57 PM12/7/16
to Qiime 1 Forum

chuanwu wang

unread,
Dec 7, 2016, 2:43:54 PM12/7/16
to Qiime 1 Forum
Thank you Tony.
One more question about the Qiime using in the BaseSpace:
Could you let me know how the command split_libraries_fastq.py. did exactly.
When I checked the same fastq file, I could not find the bardcodes there. However, in the baseSpace after being processed by split_libraries_fastq.py. in the folder of sl-out, there is one seqs.fna file, which has the right barcode there. 
Thanks again  

"Outputs

An example of the output that gets generated can be found here, the main two outputs are the following folders:

  • sl-out: includes a summary of the quality controlled sequences per sample as processed by split_libraries_fastq.py.

  • closed-ref: the main file you want to pay attention to in this folder is otu_table.biom which contains your OTU table picked using pick_closed_reference_otus.py. This is the output that you will need to use as an input for the QIIME Visualizations app."

Sara Dunaj

unread,
Dec 7, 2016, 4:26:55 PM12/7/16
to Qiime 1 Forum
Hi Tony,

Thank you very much for your help with this. I wanted to clarify if I should use your script on the de-multiplexed read files that I can get from the sequencing site or run your script and your recommendations on the original files.

Thank you!

Sara

TonyWalters

unread,
Dec 8, 2016, 12:40:11 AM12/8/16
to Qiime 1 Forum
Hello Sara,

I would run these on the original files, and skip trying to get the already-demultiplexed sequences for now.

Sara Dunaj

unread,
Dec 9, 2016, 12:09:05 AM12/9/16
to Qiime 1 Forum
Hi Tony,

Thank you for clarifying. I am running your script to attach the barcodes in front of my reads. I believe with the MiSeq run the index 1 read is the i7 barcodes and the index 2 read is the i5 barcode/ index reads. When I checked my original index files, my forward read/ i5 barcodes are in the index 2 file. (and vice versa with the i7 barcodes in the index 1 file.) I am tying the process with the index 2 barcodes attached to the read 1 reads and the index 1 barcodes attached to the read 2 reads. Below is the output:

[Sara-Dunajs-MacBook-Pro:~/Desktop/QIIME] Sara_Jeanne% python merge_bcs_reads.py Trunc_Index2.fastq Trunc_Read1.fastq Trunc_Read1_wBarcodes.fastq
Traceback (most recent call last):
  File "merge_bcs_reads.py", line 6, in <module>
    from cogent.parse.fastq import MinimalFastqParser
ImportError: No module named cogent.parse.fastq

Could you possibly help advise on this error and what I may need to change?

Thank you,

Sara

TonyWalters

unread,
Dec 9, 2016, 12:22:28 AM12/9/16
to Qiime 1 Forum
Sara, can you start the macqiime environment? The script uses some of the same dependencies that are in that environment.

Sara Dunaj

unread,
Dec 9, 2016, 4:28:32 PM12/9/16
to qiime...@googlegroups.com
Thank you - that was my next thought. I have run the test data with your script and then the extract barcode script with the option to re-orientate the reads. I have many more un-oriented reads then orientated. I attached a screen shot of the output folder. Is it possible to run both sets of data through the joined_paired_end.py and split_libraries_fastq.py scripts? Or should I try something else?

Also, I checked the new barcode files from the first extract barcodes run - the second/ i7 barcodes are not present in the new barcode file as expected... there is one that is shifted to be more in the middle of the extracted barcode read instead of the last 8 bp or in reverse complement as the first 8 bp (instead of the last 8 bp). I don't think I can use these newly formatted barcode files.

Thank you again for all your help with this. I really appreciate it.
Screen Shot 2016-12-09 at 4.15.14 PM.png

TonyWalters

unread,
Dec 10, 2016, 12:50:17 AM12/10/16
to Qiime 1 Forum
For your first question, the orientation is probably being limited by the primer sequences. The script looks for an exact match to your primers (it can handle degenerate primers, but doesn't handle mismatches or indels). Both of your primers need to be in 5'->3' orientation (if it finds the PrimerSequence as a match in a read, it assumes it's in the correct orientation, if it finds the reverse complement of the PrimerSequence, it reverses the reads, and the inverse happens for the ReversePrimerSequence). You may try shortening the primers to something like the last 3' 15 bases or so to get better matches.

For the barcode reads, I think you'll need to pull out some example sequences from the 1. raw I1/I2 reads (and do these match your expected data?) and 2. processed data. If you can label the barcodes/adapters that would be helpful.

Sara Dunaj

unread,
Dec 14, 2016, 5:07:10 PM12/14/16
to Qiime 1 Forum
Hi Tony,

My apologies for the delay. In my mapping file- all my sequences/ barcodes are in the 5'-3' orientation. I have attached a truncated set of my index 1 and 2 reads. My raw Index read 1 file has reads of the reverse complement of my i7/ reverse barcodes as well. Below are my adapters and barcodes (all in 5'-3'):

Forward / i5 Adapter Sequence:
AATGATACGGCGACCACCGAGATCTACAC

Forward Barcodes:
ACGACGTG
ATATACAC
CGTCGCTA
CTGCGTGT
TCATCGAG
CGTGAGTG
GGATATCT


Reverse/ i7 Adapter Sequence:
CAAGCAGAAGACGGCATACGAGAT

ACCTACTG
AGCGCTAT
AGTCTAGA
CAGTGAGT
CGTACTCA
CTACGCAG
GGAGACTA


Also, I contacted the sequencing site and I was able to obtain de-multiplexed files from them. I am not sure what is the best way to proceed.


Thank you very much for your time and help with this,

Sara
Trunc_Index1.fastq
Trunc_Index2.fastq

TonyWalters

unread,
Dec 15, 2016, 3:54:40 AM12/15/16
to Qiime 1 Forum
Hello Sara,

From what I'm seeing, some of the currently oriented forward barcodes hit index2, and some of the reverse complement reverse barcodes hit index1. The parameters for extract_barcodes.py like --rev_comp_bc1 and --rev_comp_bc2 and --switch_bc_order might allow you to get the barcodes into an order that matches the final 16 base pair sequences. I'm assuming you have barcodes like ACGACGTGACCTACTG (or some other specific forward + reverse barcode sequence for each sample under your BarcodeSequence values in your mapping file)?

It may also be simpler to process the already demultiplexed data. You might want to check some of the reads to see if the adapter/PCR primer and other constructs are still in the reads and in what orientation(s). There's a step we can do downstream to pull them out, but you'd want to demultiplex your separate per-sample fastq files first. This is the script for automating that process, at least somewhat, based upon using the fastq filenames as SampleID identifiers: http://qiime.org/scripts/multiple_split_libraries_fastq.html

Sara Dunaj

unread,
Dec 18, 2016, 11:03:05 AM12/18/16
to qiime...@googlegroups.com
Hi Tony,

Thank you very much for your time and help with this. I have tried running the extract barcodes with the parameters you suggested with the read files generated from your custom script with the barcodes attached:

extract_barcodes.py --input_type barcode_paired_end -f Trunc_Read1_wBarcodes.fastq -r Trunc_Read2_wBarcodes.fastq -m Pilot_QIIME_MapFile_2.txt --attempt_read_reorientation --rev_comp_bc2 --switch_bc_order --bc1_len 8 --bc2_len 8 -o parsed_barcodes/

The output folder has the same files when previously run - with much larger non-oriented reads and barcodes (see attached). The barcode files - both orientated and not oriented have my expected barcodes in the correct order and orientation.

I re-ran the extract barcodes again without the --attempt_read_orientation parameter and tried to down stream join_paired_end and split_libraries scripts. My mapping file does have the concatenated barcodes -forward + reverse in the BarcodeSequence column. Below are the command and part of the log file output from the Split Libraries script:

MacQIIME Sara-Dunajs-MBP:fastq-join_joined_Trial4 $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o Split_Lib_Test4_attachbc_NoAdaptTrim_q20/ --store_qual_scores --barcode_type 16 -q 19

Input file paths
Mapping filepath: Pilot_QIIME_MapFile_2.txt (md5: e2faa315110a097a5058ae5af069d874)
Sequence read filepath: fastqjoin.join.fastq (md5: 66599d66829ede70cec46c2be6676684)
Barcode read filepath: fastqjoin.join_barcodes.fastq (md5: 372062c3db92eee9bb6297d819ff9d32)

Quality filter results
Total number of input sequences: 1128
Barcode not in mapping file: 520
Read too short after quality truncation: 109
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 307.00

Total number seqs written    499

Does this look ok to you? There seems to be a lot of cases (~1/2) where the barcode isn't in the mapping file. Or does this output look small for starting with 10,000 reads / lines for each input file?


Also, read 1 still had the reverse complement i7 adapter, linker/ primer sequence at the beginning and vice versa with the read 2 file (i5 adapter) with both runs of the extract barcode script, so I re-ran the extract barcode script to remove/ trim these from the reads and then proceeded with this... I am concerned that because only a small fraction of reads seemed to be affected with this early and maybe I would loose a lot of data by cutting all the reads down... also the joined reads do not have the reverse complement adaptor/ index sequence, so this step may be irrelevant. Below are the results - log:
Input file paths
Mapping filepath: Pilot_QIIME_MapFile_2.txt (md5: e2faa315110a097a5058ae5af069d874)
Sequence read filepath: fastqjoin.join.fastq (md5: 6343a97112be90de8cde963a9a2fd4a0)
Barcode read filepath: fastqjoin.join_barcodes.fastq (md5: d7669b3fea2bc5a32cc0b24ef97747a1)

Quality filter results
Total number of input sequences: 503
Barcode not in mapping file: 443
Read too short after quality truncation: 2
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 195.00

only 58 sequences were written .. so this was much worse.

Do you have any other suggestions on which path to take forward? I am going to try working with the already de-multiplexed samples with the script you mentioned.

Screen Shot 2016-12-15 at 4.12.01 PM.png

TonyWalters

unread,
Dec 18, 2016, 11:24:05 AM12/18/16
to Qiime 1 Forum
Hello Sara,

I'd try the full data with the first approach. The initial and ending reads can have issues, so while they can be useful for troubleshooting, they aren't necessarily a good sample population for the entirety of the reads.

If it's still giving poor reads, I would suggest going with the per-sample fastq sequence pairs from the sequencing center.

Sara Dunaj

unread,
Dec 20, 2016, 10:51:38 AM12/20/16
to Qiime 1 Forum
Hi Tony,

I ran the following on my full data set:

MacQIIME Sara-Dunajs-MBP:QIIME $ python merge_bcs_reads.py Undetermined_S0_L001_I2_001.fastq Undetermined_S0_L001_R1_001.fastq Full_Read1_wBarcodes.fastq

MacQIIME Sara-Dunajs-MBP:QIIME $ python merge_bcs_reads.py Undetermined_S0_L001_I1_001.fastq Undetermined_S0_L001_R2_001.fastq Full_Read2_wBarcodes.fastq

MacQIIME Sara-Dunajs-MBP:QIIME $ extract_barcodes.py --input_type barcode_paired_end -f Full_Read1_wBarcodes.fastq -r Full_Read2_wBarcodes.fastq -m Pilot_QIIME_MapFile_2.txt --rev_comp_bc2 --switch_bc_order --bc1_len 8 --bc2_len 8 -o parsed_barcodes_fullreads/

MacQIIME Sara-Dunajs-MBP:QIIME $ cd parsed_barcodes_fullreads/
MacQIIME Sara-Dunajs-MBP:parsed_barcodes_fullreads $ join_paired_ends.py -f reads1.fastq -r reads2.fastq -b barcodes.fastq -o /Users/Sara_Jeanne/Desktop/QIIME/fastq-join_joined_FullReads

MacQIIME Sara-Dunajs-MBP:~ $ cd Desktop/QIIME/fastq-join_joined_FullReads/
MacQIIME Sara-Dunajs-MBP:fastq-join_joined_FullReads $ split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -m Pilot_QIIME_MapFile_2.txt -o Split_Lib_FullReads_attachedBC_q20/ --store_qual_scores --barcode_type 16 -q 19 

Below is the log:


Input file paths
Mapping filepath: Pilot_QIIME_MapFile_2.txt (md5: e2faa315110a097a5058ae5af069d874)
Sequence read filepath: fastqjoin.join.fastq (md5: 221bbbdb4678b3fbc4f3130cfa68ddce)
Barcode read filepath: fastqjoin.join_barcodes.fastq (md5: 9b9ee26c2e8d7485ae57c8029a079a86)

Quality filter results
Total number of input sequences: 2013780
Barcode not in mapping file: 981204
Read too short after quality truncation: 136147

Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 307.00
Non.env.Cricket.2    54518
Env.Husk.3    53623
Env.Cricket.3    53498
Silk.15    50256
Ovary.15    48444
Non.env.cricket.1    48099
Env.Husk.1    46240
Non.env.cricket.3    44381
P.tep.12    40879
Env.Husk.2    38424
Venom.14    29445
Ceph.14    27015
Ovary.13    26816
Ceph.13    21856
Silk.14    21327
Fat.15    21213
Venom.15    20498
Env.Cricket.2    20479
Silk.13    18193
Ceph.15    18043
Venom.13    17665
P.tep.10    17525
Env.Cricket.1    17518
S.g.M.9    17274
Fat.14    16628
S.g.F.7    16443
S.g.F.6    16432
Fat.13    16336
Ovary.14    15214
Zymo.Mock.Community    13874
P.tep.11    12485
S.g.F.8    12305
Ext.Ctrl.5.19    1194
SSC.Buf.3.9    1011
Ext.Ctrl.5.13    688
Ext.Ctrl.7.12    590

Total number seqs written    896429

I am curious what you think of the total number of quality reads I obtained - I think its pretty good >10K sequences per sample and I expected less for my controls at the end.

Thank you,

Sara

TonyWalters

unread,
Dec 20, 2016, 11:29:58 AM12/20/16
to Qiime 1 Forum
Hello Sara,

Those are reasonable numbers. But these reads are not oriented, correct? And they still have sequencing constructs in them?

If these aren't oriented, I'm thinking you still have more usable reads, if we get the orientation sorted out. Can you run extract_barcodes.py as above, with --attempt_read_reorientation included during the extract_barcodes.py step (and some other parameters removed), e.g.:
extract_barcodes.py --input_type barcode_paired_end -f Full_Read1_wBarcodes.fastq -r Full_Read2_wBarcodes.fastq -m Pilot_QIIME_MapFile_2.txt  --bc1_len 8 --bc2_len 8 -o parsed_barcodes_fullreads_oriented

There is a custom script to get the more frequently occurring sequences in the resulting barcodes fastq file, here:
https://gist.github.com/walterst/7a1e8608b82c5ce580c9

You'd need to download it, unzip it, and put it in the folder where your barcodes fastq file is and run it using this command (replacing files as appropriate):
python get_barcode_freqs.py barcodes.fastq barcode_frequences.txt 16

This can help to figure out how the barcodes compare to those in your mapping file.

Sara Dunaj

unread,
Dec 22, 2016, 4:41:02 PM12/22/16
to Qiime 1 Forum
Hi Tony,

Thank you for your help with this. I have had issues when running the extract barcodes script with the --attempt_read_reorientation parameter, as most of my reads do not re-orientate and there is not much to move forward with. I have attached the screen shot of the output files when running this script with this parameter on the full data set. In my test run, the new joined reads do not have the sequencing construct. I have not had success checking my full read output files, as my system can't open the file in my text editor to check. I tried grep -c but without luck ( I keep getting 0 on my non-joined/ original test set where I know the construct sequence is there). Could I use the option to check

Also, I was hoping to better understand how your custom script works and how it will help my data. I am unclear on its use. I am also hoping to run USearch for OTU picking next.

Thank you very much,

Sara
Screen Shot 2016-12-21 at 3.11.14 PM.png

TonyWalters

unread,
Dec 22, 2016, 5:08:32 PM12/22/16
to Qiime 1 Forum
Hello Sara,

The purpose of the custom script is to find the most frequently occurring sequences. This can help us to determine if the barcode orientation(s) in your mapping file, for instance, do not correspond to the barcodes of the reads when they get oriented correctly.

But, first things first-let's try to resolve why the reads won't orient. My suspicion is that there is something amiss with the LinkerPrimerSequence and/or ReversePrimer values in your mapping file (these should both be 5'->3' in orientation). Can you open some of the reads that failed to orient, and list the LinkerPrimerSequence/ReversePrimer sequences as they are shown in your mapping file so we can what is keeping the extract_barcodes.py script from properly detecting the orientations? 

Sara Dunaj

unread,
Dec 27, 2016, 4:11:02 PM12/27/16
to Qiime 1 Forum
Hi Tony,

Thank you for your help with this and explaining how the custom script you provided works. I have run that script on my full reads (without the orientation command used prior) and I attached the output with the suffix nonorient. I also ran it on the output from the extract barcode script output with the reverse orientation parameter. I also attached this text file.

In my mapping file my LinkerPrimerSequence and ReversePrimer values are in the 5' -> 3' orientation. I have also attached this file - the reverse primer sequence does not contain the linker and pad -only the reverse primer sequence. (FYI I removed my sample descriptions).  I attached some (10,000 lines) of reads that failed to orient as well.

Thank you for your help with this. I really appreciate it.
Sara
barcode_frequences_nonorient.txt
barcode_frequences_oriented.txt
Trunc_reads1_not_oriented.fastq
Trunc_reads2_not_oriented.fastq
Pilot_QIIME_MapFile_nodescrip.txt

TonyWalters

unread,
Dec 27, 2016, 5:25:57 PM12/27/16
to Qiime 1 Forum
Hello Sara,

I've done some digging in your sequences. Many of the non-oriented reads are PhiX reads from doing NCBI blast searches with them, so those aren't usable reads.

When I found a read that did hit something that looked like genuine bacterial 16S data, I looked at what potentially should be the primer sequences (or at least what comes after the first 8 bases, which should be your barcodes). Here is an example, with green indicating the barcode segment, and red/purple indicating potential primers:
@M01032:441:000000000-ATD8V:1:1101:10205:2684 2:N:0:0
GATTGGACGCTGGCGGCAGGCCTAACACATGCAAGTCGGACGGTAGCACAGGGGAGCTTGCTCCCCGGGGGACGAGTGGCGGACGGGGGAGTAATGGCTGGGAAAATGCCCGGGGGAGGGGGATAACCACTGGAAACGGCAGCTCCTACCGCATAACGTCTTCGGCCCAAAGTGGGGGATCTTGGGGCCTTACGCCAACGGCGGGGCCCCGATGGGGCGAGCTGGGAGGAGGGGCGACGGCGCCCCCCGGC

Forward primer?
GCTGGCGGCAGGCCTAACACATGCAAG




@M01032:441:000000000-ATD8V:1:1101:10205:2684 1:N:0:0
CTGGACCGTGTCTCAGTTCCAGTGTGGCTGGTCATCCTCTCAGACCAGCTAGGGATCGTCGCCTAGGTGAGCCGTTACCCCACCTACTAGCTAATCCCACCCGGGCACCTCCGGTGGCGTGAGGCCCGTAGATCCCCCCCTTTGGTCCGAAGACGTTATGCGGTATTAGCTACCGTTTCCAGCCGTTATCCCCCTCCACCGGGCAGATTCCCCGCCATTACTCACCCCTACGCCCCTCGTCCCCCGGGGAC

Reverse primer?
TGTCTCAGTTCCAGTGTG
http://www.nature.com/ismej/journal/v3/n5/full/ismej20094a.html

These don't look like those in your mapping file (I checked the reverse complement too). Did the sequencing center perhaps use a pool of primers, rather than a single (if degenerate) batch? Or maybe this was a run with other peoples' samples/primers being shared with yours, and some of those reads ended up in your data? The 8 base pair barcodes didn't match those in your mapping file, but I wasn't sure if you attached the full list of all barcodes.

It may be quite a bit of a mess to sort out, and getting the per-sample fastq files from the sequencing center may be preferable at this point.

Sara Dunaj

unread,
Dec 28, 2016, 1:21:56 PM12/28/16
to Qiime 1 Forum
Hi Tony,

Thank you for checking on these reads for me. I expected some to be from the PhiX spike in. I listed all the barcode sequences that I used. I made the sample library myself - there were no primer pools (all samples libraries were generated individually). I normalized the barcoded samples and then pooled them - then sent off to be sequenced by itself with custom sequencing primers to match my primer sequences. I am uncertain where those other primer sequences came from - I do not recognize them.

I am curious though - the non-oriented reads would no longer have the barcode sequences you highlighted correct? Wouldn't they have been extracted into the non-oriented barcode file? I just want to make sure I am understanding these pre-processing scripts.

Another question - could I use the full reads processed without orientation - wouldn't only the sequences/ reads associated with my barcodes be pulled out by the split libraries script?

I do have the demultiplexed files - not sure how to best to pre-process them. Is the split library script for demultilpexed reads just do quality filtering and put them in a format for downstream analysis?

Thank you,

Sara

TonyWalters

unread,
Dec 28, 2016, 1:43:58 PM12/28/16
to Qiime 1 Forum
For the non-oriented reads, yes, the barcodes would have been in a separate file-still it looks like other primers are in those reads that don't look like yours.

Yes, you could use the reads that are processed without orientation, I was just thinking that you may be throwing out usable data, hence the further delving into the non-oriented data.

For the demultiplexed files, I would try the multiple_join_paired_ends.py script first (make sure this gives reasonable results), and the multiple_split_libraries_fastq.py script on the resulting folder of folders (probably will have to get rid of the unjoined fastq files, and set options to not use the filename, but rather the folder names, but can cross that bridge if you want to go down that road, see http://qiime.org/scripts/multiple_split_libraries_fastq.html).

-Tony

Sara Dunaj

unread,
Jan 4, 2017, 10:09:43 PM1/4/17
to Qiime 1 Forum
Hi Tony,

Thank you for letting me know. I have no idea how there are other primers in the sample set but I am concerned that you found some. I am also trying to run the de-multiplexed files (see notes below)...

MacQIIME Sara-Dunajs-MacBook-Pro:Seq_Site_Demultiplexed_Reads_Pilot $ multiple_join_paired_ends.py -i Seq_Site_DeMulti_Fastq/ -o Multiple_Joined_output_folder

Traceback (most recent call last):
  File "/macqiime/anaconda/bin/multiple_join_paired_ends.py", line 201, in <module>
    main()
  File "/macqiime/anaconda/bin/multiple_join_paired_ends.py", line 180, in main
    match_barcodes, barcode_indicator)
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/workflow/preprocess.py", line 200, in get_pairs
    "and read2_indicator parameters as well.")
ValueError: Invalid filename found for splitting on input for file /Users/Sara_Jeanne/Desktop/QIIME/Seq_Site_Demultiplexed_Reads_Pilot/Seq_Site_DeMulti_Fastq/7650_9872_41706_ATD8V_201608_Ceph_13_ATAGCGCT_ACGACGTG_R1.fastq, check input read1_indicator and read2_indicator parameters as well.

If I can get this to work - do you think this could remove any potential primers / incorrect reads from the data set? I would hate to include incorrect/ artifact sequences in my dataset. Do you recommend any scripts or ideas for finding and removing these from my data set?

Thank you,

Sara

TonyWalters

unread,
Jan 5, 2017, 2:33:08 AM1/5/17
to Qiime 1 Forum
Sara,

These scripts won't specifically look for artifacts. There is a way with a custom script later to search specifically for your primers in the reads, cut out everything before the primer and after the reverse primer. But you may have to manually investigate some of these already demultiplexed reads to see if the sequencing center removed the artifacts from the reads.

You'll have to specify the read1_indicator and read2_indicator values for multiple_join_paired_ends.py. See initial description on this page: http://qiime.org/scripts/multiple_join_paired_ends.html
It looks like you will use --read1_indicator _R1 and --read2_indicator _R2




Reply all
Reply to author
Forward
0 new messages