Two barcode demultiplexing and primer trimming of Ilumina reads

2,024 views
Skip to first unread message

pmb0010

unread,
Mar 8, 2013, 4:38:04 PM3/8/13
to qiime...@googlegroups.com
Hi All-

I am fairly new to the bioinformatics world.  I am attempting to use Qiime for data analysis of marine metagenomic samples.  I have a couple of issues that I am trying to see if there are solutions to. 

1).  First I was wondering whether or not Qiime could demultiplex samples with duel barcodes?  I have amplicon sequences that were run in a lane on a Illumina HiSeq that have a different forward and reverse sequence barcode on them.  Forward and reverse reads have been combined into a single read already (sequences have been overlapped).  The issue is several samples have the same forward barcode and several samples have the same reverse barcode.  It is the forward and reverse barcode combination that distinguishes the samples apart from one another.  I have demultiplexed Illumina sequences using a single barcode on previous samples, but could not find any information on whether or not it is possible to demultiplex samples with two barcodes and methods to go about setting up the mapping files in Qiime. 


2).  Second, is there any way to trim primer sequences off of Illumina reads in Qiime?  I see there is an option in the split library script for 454 reads but it is not there for Illumina reads.  I would like to trim the primers off my sequences before I start any data analysis. 

Thank you in advance for any assistance! 

Laura Wegener Parfrey

unread,
Mar 8, 2013, 4:52:21 PM3/8/13
to qiime...@googlegroups.com
Hi Pamela,
Qiime does not have any provisions in place to demultiplex with 2 barcodes. As this is not a common scenario you will likely need to write a custom solution.

2) Illumina reads don't generally contain primers so this option is not included in split_libraries_fastq.py.  One option would be to convert your fastq to fasta + qual files with convert_fastaqual_fastq.py and then you could use split_libraries.py. 

Laura

--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Laura Wegener Parfrey
Postdoctoral Research Associate
University of Colorado
Boulder, CO 80309

michael morando

unread,
Aug 20, 2013, 11:52:45 AM8/20/13
to qiime...@googlegroups.com
Hello,

Has any resolution come to this problem?  I have some what of a similar issue.  I added a 5bp barcode to my forward reads and a 6bp index to my reverse read so that I could use less barcodes/indexes and still have 128 unique ids, i.e. 12 different reverse indexes and 9 different forward barcode for a total of 128 combinations.  I used usearch (merge_fastq)  to stitch these files (there was an ~90bp overlap between reads) and then used the split_libraries_fastq.py function to first demultiplex the indexes by the reverse read.  So now I have one file that has the reads split by index.  I know want to further demulitplex based on the forward barcodes.  I tried using:

split_libraries_fastq.py -i seqs.fna -o /stich_Din_Dbar -m first_run.mapping_BC_corrected.txt --store_qual_scores -r 1000 -n 1000 -q 0 --max_barcode_errors 0 --barcode_type '5' --retain_unassigned_reads
the -r and -n are so high bc I don't want any sequences thrown out on this step unless they do not completely match the barcode.

the error msg I recieve is:

Demultiplex and quality filter (at Phred Q20) two lanes of Illumina fastq data and write results to ./slout_q20.:
 split_libraries_fastq.py -i lane1_read1.fastq.gz,lane2_read1.fastq.gz -b lane1_barcode.fastq.gz,lane2_barcode.fastq.gz --rev_comp_mapping_barcodes -o slout_q20/ -m map.txt,map.txt --store_qual_scores -q20

Quality filter (at Phred Q20) one non-multiplexed lane of Illumina fastq data and write results to ./slout_single_sample_q20.:
 split_libraries_fastq.py -i lane1_read1.fastq.gz --sample_id my.sample -o slout_single_sample_q20/ -m map_not_multiplexed.txt -q20 --barcode_type 'not-barcoded'

Quality filter (at Phred Q20) two non-multiplexed lanes of Illumina fastq data and write results to ./slout_single_sample_q20.:
 split_libraries_fastq.py -i lane1_read1.fastq.gz,lane2_read1.fastq.gz --sample_id my.sample -o slout_single_sample_q20/ -m map_not_multiplexed.txt -q20 --barcode_type 'not-barcoded'

is there a way to demultiplex off this forward barcode?  QIIME should see it just as a 410bp read with a barcode as the first 5 bases.  I also think part of the problem may be that my mapping file has the same barcodes repeated many times, which I have uploaded.  This should not be an issue for determine which sample is which as that in combination with the demultiplexed index should be enough information.  Please help.  Others at my universtiy are having the same issue!

Thanks
first_run.mapping_BC_corrected.txt

michael morando

unread,
Aug 20, 2013, 12:30:39 PM8/20/13
to qiime...@googlegroups.com
I just realized that this could also be because I am using the output of my first demulitplexing which is an .fna file.  I do not think this is a fastq file but a fasta.  Perhaps if I somehow combine the seqs.fna file with the seqs.qual file and produce a fastq this would work?  sorry for the second post.

Tony Walters

unread,
Aug 20, 2013, 1:25:18 PM8/20/13
to qiime...@googlegroups.com
Hello Michael,

The support for paired-end barcodes hasn't been officially implemented in QIIME. I've got a work-around, described below, although you'll have to do some manual processing for stitched paired end data.

To use paired end barcodes with single-read (not stitched) data:
1. Create fastq barcode reads files if you don't have them already*. I've put up a script, parse_bc_reads_labels.py, on github here that you can download with the "download gist" button on the left:
https://gist.github.com/walterst/6028738
The page lists the usage. With this script you will create an output barcodes fastq file, a reads fastq (with barcode removed).
2. Repeat the same process on the second reads to create the barcode/reads from the other end.
3. Combine the barcodes using the combined_fastq_barcodes.py script, found here: https://gist.github.com/walterst/6284164
4. Alter your mapping file to have barcodes that are a combined version of the first reads (created in step 1) and the second reads (step 2), so if your first read barcode was ATCCG and the second read barcode was CCGAAT, then your BarcodeSequence in the mapping file would be ATCCGCCGAAT. You want to make sure all of your final barcodes are unique and the mapping file doesn't have any errors when you run check_id_map.py on it.
5. Use the combined barcodes file, fixed mapping file, and one of the two reads with split_libraries_fastq.py with the added parameter to match your barcode length. In your case, this would be --barcode_type 11

You have to additional filtering for stitched reads. This is because during the stitching process, some of the reads will fail to be stitched, and therefore will be  subset of the combined barcodes created above. split_libraries_fastq.py expects the barcode read and the sequence reads to be an exact match (loading the sequences into memory to allow subsets of matches wouldn't be feasible in some cases on Illumina datasets).

After step 3, you would want to filter the combined barcodes. If the labels haven't been changed by the stitching process, you should be able to use these commands:
egrep '^@' X| tr -d '@' > seqs_to_keep.txt
filter_fasta.py -i Y -o stitched_barcodes.fastq -s seqs_to_keep.txt
where X is the stitched fastq file, and Y is the combined barcodes fastq file.

If the labels were changed, then you might have to do more to parse them, Nanelle wrote up a process for handling PandaSeq combined reads here: https://groups.google.com/forum/?fromgroups=#!topic/qiime-forum/CO9EmR4FH58

I hope this helps,
Tony

James Angus Chandler

unread,
Mar 21, 2014, 5:23:21 PM3/21/14
to qiime...@googlegroups.com
Hi Tony,

We have paired-end barcodes on forward and reverse reads. We have four fastq files: Forward Read; Forward Barcode, Reverse Read, and Reverse Barcode. I think I understand your workaround for the paired-end barcodes, but then there is the problem of the stitching of reads. I think the -b option of the join_paired_ends.py script will solve the problem. How does this workflow sound?

1. Step 3 below: "Combine the barcodes using the combined_fastq_barcodes.py script, found here: https://gist.github.com/walterst/6284164"
2. Run join_paired_ends.py with -b to the file just created in step 1 above
3. Steps 4 and 5 below.

Thanks,

Angus

Tony Walters

unread,
Mar 21, 2014, 5:36:28 PM3/21/14
to qiime...@googlegroups.com
Hello James,

A more formal script was implemented in QIIME 1.8.0 (http://qiime.org/scripts/extract_barcodes.html), which might be of help here. The third example for extracting barcodes from paired reads could be used for the forward barcode and reverse barcode reads to make a combined barcode read. This resulting barcode fastq file could be fed into the join_paired_ends.py script as the -b parameter to filter out the barcodes so they will match the resulting stitched reads. Then the stitched reads and the filtered barcodes could be used with split_libraries_fastq.py to do the quality filtering/demultiplexing.

-Tony


For more options, visit https://groups.google.com/d/optout.

Jenna Morgan Lang

unread,
Mar 24, 2014, 3:29:21 PM3/24/14
to qiime...@googlegroups.com
Hi Tony,

So, I'm trying to run through this as you've suggested here. I'm getting stuck when trying to join the reads.

Here's my IPython notebook for task.
Here are the original sequence files.

When doing this:
join_paired_ends.py -f $PWD/forward_reads.fastq -r $PWD/reverse_reads.fastq -b $PWD/barcodes.fastq -o $PWD/fastq-join_joined

I get this:
Traceback (most recent call last):
  File "/macqiime/QIIME/bin/join_paired_ends.py", line 168, in <module>
    main()
  File "/macqiime/QIIME/bin/join_paired_ends.py", line 164, in main
    write_synced_barcodes_fastq(assembly_fp,index_reads)
  File "/macqiime/lib/python2.7/site-packages/qiime/join_paired_ends.py", line 68, in write_synced_barcodes_fastq
    " paired-end ID processed was:\n\'%s\'\n" %(joined_label)
StopIteration: 

Reached end of index-reads file before iterating through joined paired-end-reads file! Except for missing paired-end reads that did not survive assembly, your index and paired-end reads files must be in the same order! Also, check that the index-reads and paired-end reads have identical headers. The last joined paired-end ID processed was:
'HWI-M02034:34:000000000-A7U20:1:1101:15638:1444 1:N:0:'

Thank you!
Jenna

Tony Walters

unread,
Mar 24, 2014, 3:44:52 PM3/24/14
to qiime...@googlegroups.com
Jenna, can you post the beginning of each of the input files?

E.g.
head $PWD/forward_reads.fastq
head $PWD/reverse_reads.fastq
head $PWD/barcodes.fastq

Jenna Morgan Lang

unread,
Mar 24, 2014, 3:50:14 PM3/24/14
to qiime...@googlegroups.com
Well, they're gzipped files. Is that my problem? extract_barcodes.py didn't seem to mind them, but maybe fastq-join is unhappy about them?


--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/0jgI9zC__NQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Tony Walters

unread,
Mar 24, 2014, 3:57:33 PM3/24/14
to qiime...@googlegroups.com
I think the script should handle the gzipped format (most of the scripts that deal with fastq files should be able to handle input fastq or gzipped fastq data).

Can you unzip the input files in this case? I want to see if there is an issues with the read1/read2 headers or if something is awry with the barcodes header (or if we should look elsewhere for the source of the problem).

Jenna Morgan Lang

unread,
Mar 24, 2014, 4:01:48 PM3/24/14
to qiime...@googlegroups.com
Here ya go! Thanks for the help!

Forward:

@HWI-M02034:34:000000000-A7U20:1:1101:14765:1433 1:N:0:
TACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGTTTTTTAAGTTAGCTGTGAAATCCCGAGGCTCAACCTCTTAACTGCCTTTAAAACTCCTGGTCTTGAGTTCTGGAGAGGTGAGTGGAATTCCAAGTGTAGCGGTGAAATTCGTAGATATTTGGAGGAACCCCATTGGCGAAGGCGCCTCCCTGGCTCGATACTGCCGCTGAGGTACGAAAGTTTGGGTAGCAACC
+
>1>111111111131BAEEAAAG0000BB11FDE/A/A///DBAA/AEBCCED??//////?E1B@@2111FFBFBG11B////>GGFBGAGB11121B1BBF1B1B11<F111@01@@@1>1?111111?<</F/F0>D.11<>1<>11>1=D################################################################################################
@HWI-M02034:34:000000000-A7U20:1:1101:15638:1444 1:N:0:
TACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGTAGGCGGATGATTAAGTCAGGGGTGAAAGTTTGCAGCTCAACTGTAAAATTGCCTTTGATACTGGTCATCTTGAGTTGTATTGAAGTAGGCGGAATATGTAGTGTAGCGGTGAAATGCATAGATATTACATAGAACACCAATTGCGAAGGCAGCTTACTAAGTACTAACTGACGCTGATGGACGAAAGCGTGGGTAGCGAAC
+
>1>11111@A1AA00AAEE0FEG?0//BF1FG11/AFE12111/BGA/F0//1/E/>/1BB211FF2100/E?/11>BBFDF21BGF<CGEBGD2B2FGFFGHHH1DF2FGF1FF2DGHHB21GH1F1@F221@@F1BD@?-<>1>1>=FGBGH0D--CC.0D=0;0CBG0;CGC0;CGF0;0:/:C?.CF0BA-;-.A..;CF0FF00;90;FB/BBB/;-ABB-FF######################
@HWI-M02034:34:000000000-A7U20:1:1101:16853:1449 1:N:0:
TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGTCTTTTTAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGATGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGACT

Reverse:
@HWI-M02034:34:000000000-A7U20:1:1101:14765:1433 4:N:0:
CCTTTTTGCTCCCCTCACTTTCGTACCTCATCTTCATTCTCTTTCCTTTTCGCTGCCTTCTCCATTGTTGTTCTTCCTCATCTCTACGCCTTTCTCCTCTCCACTTCGTTTTCCACTCCCCTCTCTCCTACTCTCGTCCACCAGTTTTACATTCCTTTCCCACTTTTCTCCCCTGTCTTTCTCCCCTTTCTTTCTTCTCCGCCTACGTTCGCTTTTCGCCCCGTTCTTCCTCTTCTCTCTTTCCCCCTCC
+
>1>1>B111B1@AA1A1AB1GDA00AAFA11A2B22222D2221A112B12////AAAFG1BB1121220BAA2BAA111D1DBF21///B?FG2B@1FG0>1B1@1/00??BF2@B00?/>EFG11011BBF21/////10/0B>B22122>>222@1111001@121<10/0/11<<<<1?10<<01==D##########################################################
@HWI-M02034:34:000000000-A7U20:1:1101:15638:1444 4:N:0:
CCTTTTCGCTTCCCTCGCTTTCGTCCCTCAGCGTCCTTTAGTACTTCGTAAGCTGCCTTCGCAATTGGTGTTCTATGTAATATCTATGCATTTCACCGCTACACTACATATTCCGCCTACTTCAATACAACTCACGCTGACCAGTATCAAAGGCAATTTTACAGTTGAGCTGCAAACTTTCACCCCTGACTTAATCATCCGCCTACGGACCCTTTAAACCCAATCATTCCTGATACCGCTCGCACCCTCC
+
>>>1>B@1>A1>AA1B1AFCFGEHEA0BE01A0BA//2A11A22DD1/B00BBF1B0BFG/A//A@B1@/B>2F2E2@22B2BFF2@212BFFE2B1/>E/>BBF0>1>2BFF>0//?/BBD121>2111BFG11/0?//0010@1@D22110/00??111?11FD<111111<11>DGFDD=<0<E./<DD##########################################################
@HWI-M02034:34:000000000-A7U20:1:1101:16853:1449 4:N:0:
CCCTTTCTCTCCCCTCGCTTTCGTCTCTCAGTTTCATTGTCGGCCCAGCAGAGTTCTTTCTCCGTTGTTGTTCTTTCCGATCTCTACGCATTTCCCCGCTCCACCGGATTTTCCCTCTGCCCCTACCGTACTCCAGCTTGGTAGTTTCCATCGCCTGTCCAGGGTTGAGCCCTGGGCTTTGACGGCTGACTTTACAAGCCACCTACTGACGCTTTACGCCCAATCATTCCTGATAACGCTTGCATCCTCT

Barcodes:
@HWI-M02034:34:000000000-A7U20:1:1101:14765:1433 2:N:0:
CGCCTTAGAACCAGTC
+
1111>11111>111B1
@HWI-M02034:34:000000000-A7U20:1:1101:15638:1444 2:N:0:
CTCCGTTCTTGGCTAT
+
1AA>AAF1AA33>FBF
@HWI-M02034:34:000000000-A7U20:1:1101:16853:1449 2:N:0:
TACTGTTTTTCAGCGA




On Mon, Mar 24, 2014 at 12:44 PM, Tony Walters <william....@gmail.com> wrote:

--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/0jgI9zC__NQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Tony Walters

unread,
Mar 24, 2014, 4:45:55 PM3/24/14
to qiime...@googlegroups.com
Hmmm, well the labels look fine, and the one that it stopped on was the second one in each of the reads/barcodes file above (HWI-M02034:34:000000000-A7U20:1:1101:15638:1444). I don't think the trailing read number (e.g. 2:N:0:) should affect the stitching process. Can you try calling the join_paired_ends.py without the barcodes (-b) parameter and see if it completes?
May ask for a slightly larger sample reads (the faux reads/barcodes that I created from the email didn't throw an error for me), can you create larger versions of the input files:
head -n 20 $PWD/forward_reads.fastq > test_forward_reads.fastq
head -n 20 $PWD/reverse_reads.fastq > test_reverse_reads.fastq
head -n 20 $PWD/barcodes.fastq > test_barcodes.fastq
and attach those files?

James Angus Chandler

unread,
Mar 24, 2014, 5:54:05 PM3/24/14
to qiime...@googlegroups.com
Hey Tony and Jenna,

I am working on a very similar dataset as Jenna (same format, but different run) and I was getting the same error as described above:


Reached end of index-reads file before iterating through joined paired-end-reads file! Except for missing paired-end reads that did not survive assembly, your index and paired-end reads files must be in the same order! Also, check that the index-reads and paired-end reads have identical headers. The last joined paired-end ID processed was:
'MISEQM00958:38:000000000-A6BGH:1:1101:15488:1333 1:N:0:'

I deleted the trailing " 1:N:0:" from the forward read, the " 4:N:0:" from the reverse read, and the " 2:N:0:" from the barcode and it works now.

Hope this helps,

Angus

Jenna Morgan Lang

unread,
Mar 24, 2014, 6:12:21 PM3/24/14
to qiime...@googlegroups.com
Aha! So, Tony, do you still want more information/files from me, or does this solve the mystery?

I was able to use join_paired_ends.py without the (-b) parameter, and it seems to have worked fine.

Thanks, Angus!



Tony Walters

unread,
Mar 24, 2014, 6:18:17 PM3/24/14
to qiime...@googlegroups.com
Thanks James,

I think James has the answer (I'm checking to see if this issue is already resolved in the development version; we may need to open a ticket about this).

You might try a sed command to create filtered versions of the files (unless James has a more efficient approach), e.g.
sed  's/ 1:N:0://g' $PWD/forward_reads.fastq > $PWD/forward_reads_filtered.fastq
sed  's/ 4:N:0://g' $PWD/reverse_reads.fastq > $PWD/reverse_reads_filtered.fastq
sed  's/ 2:N:0://g' $PWD/barcodes.fastq > $PWD/barcodes_filtered.fastq

You can use head or less to browse the _filtered files to see if it removed the trailing run numbers, and then try passing in the _filtered files (including the barcodes file) as input and see if that resolves the issue for you.

-Tony

James Angus Chandler

unread,
Mar 25, 2014, 4:38:37 PM3/25/14
to qiime...@googlegroups.com
Hey Tony,

So the joining seems to be working fine, but now I am encountering a problem with the split_libraries_fastq.py step. Specifically, all the reads are "Read too short after quality truncation" which means none are left for analysis. I played with the -r and -p commands as suggested here https://groups.google.com/d/msg/qiime-forum/_yPbstZPc6Y/hnYuUwAkLAkJ. With an r of 9, half the reads are removed and it wasn't until an r of 12 that most the reads passed. I also noticed that the number of passed reads abruptly increases going from an r of 11 (75% passed) to an r of 12 (95% passed). Given that the default is 3, is it reasonable to increase r so much to get this data?

Also, adjusting p only slightly changed the results.

Thanks,

Angus (AKA James)

Tony Walters

unread,
Mar 25, 2014, 5:06:49 PM3/25/14
to qiime...@googlegroups.com
Hello James,

The low quality in the center of the reads does create some issues with the split_libraries_fastq script (especially since the truncation is no longer as the expected end of the read, but rather in the middle, creating an apparently short read). In this case, increasing the -r parameter does seem reasonable, as the overlapping region of low quality would have to be wrong in both the forward and reverse read (or happen to have a higher quality score for an incorrect base call in the case of a mismatch). At least that's my thinking on the issue, but other opinions are welcome.
Reply all
Reply to author
Forward
0 new messages