extract_barcodes.py command parameters selection

29 views
Skip to first unread message

Umar Sohail

unread,
Nov 18, 2017, 1:27:06 AM11/18/17
to Qiime 1 Forum


Dear Colleagues
I have R1.fastq and R2.fastq files from MiSeq for V4 region. Mapping file says that barcode length is 8. I do not know which parameters should I use for --bc1_len and --bc2_len. when I used 8 for each of them I got almost 50% of the total sequences. (Total number of input sequences: 1028678, Barcode not in mapping file: 674052)
When I used 6 for each of I got only 1% sequences (Total number of input sequences: 1028678, Barcode not in mapping file: 1026387.)
I do not know how to read fastq files.
I do copy here few initial lines of R1 and R2 files and also mapping file. I request you to help me build extract_barcodes.py command. Should I do reorientation of sequences as well?

mapping file

#SampleID BarcodeSequence LinkerPrimerSequence BarcodeName ProjectName Description
HSB.1 GAGAGTGT GTGCCAGCMGCCGCGGTAA 515Fbar1 091317MJ515F HSB.1
HSB.10 GAGTCACT GTGCCAGCMGCCGCGGTAA 515Fbar10 091317MJ515F HSB.10
HSB.11 GAGTCAGA GTGCCAGCMGCCGCGGTAA 515Fbar11 091317MJ515F HSB.11
HSB.12 GAGTCTCA GTGCCAGCMGCCGCGGTAA 515Fbar12 091317MJ515F HSB.12
HSB.2 GAGATCAG GTGCCAGCMGCCGCGGTAA 515Fbar2 091317MJ515F HSB.2
HSB.3 GAGATCTC GTGCCAGCMGCCGCGGTAA 515Fbar3 091317MJ515F HSB.3
HSB.4 GAGATGAC GTGCCAGCMGCCGCGGTAA 515Fbar4 091317MJ515F HSB.4
HSB.5 GAGATGTG GTGCCAGCMGCCGCGGTAA 515Fbar5 091317MJ515F HSB.5
HSB.6 GAGTACAG GTGCCAGCMGCCGCGGTAA 515Fbar6 091317MJ515F HSB.6
HSB.7 GAGTACTC GTGCCAGCMGCCGCGGTAA 515Fbar7 091317MJ515F HSB.7
HSB.8 GAGTAGAC GTGCCAGCMGCCGCGGTAA 515Fbar8 091317MJ515F HSB.8
HSB.9 GAGTAGTG GTGCCAGCMGCCGCGGTAA 515Fbar9 091317MJ515F HSB.9


R1
@D00420:169:HVM3LBCXY:1:1101:1396:2224 1:N:0:GGCTAC
GAGTCACTGTGCCAGCAGCCGCGGTAATACGTAGGGTGCTAGCGTTATCCGGATTTACTGGACGTAAAGGGTGCGTAGGTGGTCTTTCAAGTCGGTGGTTAAAGGCTACGGCTCAACCGTATTAAGCCGCCGAAACTGGAAGACTTGAGTGCAGGAGAGGAAAGTGGAATTCTCAGTGTAGCGGTGAAATGCGTAGATATTGGGAAGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGCAACTGACACTG
+
GGGGAGGGGGIIIIIIGGGGGI<AGGGGGGGGGGGGGAG<GGAAGGGGGG.GGG.<AAGAGGGGGGGGGGGGGIGGGIAGGIIGGGGGAAG<A<AAGGAGGA<<AGGIGG<GAGGGGGA<GGGGGGGAAAAGG.AAGGGGAGGGGGGGGGGGA.A.GA<AGAGG.7..A.GGGIGGGGGGGIIIGAGGGGGGG77AGGAGIIGGGGGAGGGGGAAGGGIGGI.<GGAGGGGGGGGGIIGIIIIIAAGGGGA
@D00420:169:HVM3LBCXY:1:1101:1553:2151 1:N:0:GGCTAC
GGACTACAGGGGTATCTAATCCTATTTGCTCCCCACGCTTTCGTGCTTGAGCGTCAGTATCAGTCCAGGCAACCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTTACCGCTACACATGGAATTCCATTGCCCTCTCCTGTACTCTAGCCTACCAGTATCTATGGCTATATGGGGTTAAGCCCCACGCTTTCACCACAAACTTAATGGGCCGCCTACGCACCCTTTACGCCCAATAATACCGG
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIGIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIGIIGIIIIIIIIIIIGIIIIIGGIGGGIIGIIIIIIGGGGGIIIIGGGIIGGGIIIIIIIIIIGGGGGGGIIGGGGGGGIGAGGAGGIIGGGGGGGIIGGGGGGGIGAGGGGGIIG<
@D00420:169:HVM3LBCXY:1:1101:2313:2121 1:N:0:GGCTAA
GGACTACGGGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGAGCCTCAGTGTCAGTTACAGTCTAGTGAGCTGCCTTCGCAACTGGTGTTCTTCGTTATATCTAAGCATTTCACCGCTACACCACGAATTCCGCTCTCCTCTACTGTACTCAAGACTAACAGTATCAAATGCAATTTTTGGTTGAGCCCCCACCCTTTCCCCCCGACCTTATCAACCACCCTCGCTCGCCTTTCACCCCATTAATCCGG
+
GGGGGIIIIIIIIIIIIIIIIIIIIIII.G.GGGII.GGIIIIIII.GGIIIIIIIII.GGII<G.GG..<GG.GG.GGIII.A..<G.GGGI.GG...<GGGIII.GGGIIIIIIIIIIIIII...<AGIIII.GG..<GGII.GGG.GGIIGGG.GG..GGGI.GG.G.<GGG.......<.<....A.....7.<GG....7...7.7.A..7.7.77G....7...7.7A....7G7G..AG..7.7
@D00420:169:HVM3LBCXY:1:1101:2583:2197 1:N:0:GGCTAC
GAGTACAGGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTG
+
GGAGGGGIGGGIGGGIIGIIIIIGIIIGGGIGGGGIGGGGGG<GAAGGGGGIAGGIIGGGGIIIIIIIIIIIGGIIIGGGIIA.AGGGGIGIIIIIGGGIGGGGIGGGGGGGGGGGAGGGIIGGGGIIGGIGIIGIGGGGGGIIGGGIIIIAGGGIIGAGIIIIAGIIIIGGGGGIIGGAGGIGAGGGGAGGGGIIGAGGGIGGIGGIGGIGGGGAGGIIIIIIGGGIIGGIGIIGGGGIIIGAGGGA.77
@D00420:169:HVM3LBCXY:1:1101:3030:2133 1:N:0:GGCTAC
GAGTCAGAGTGCCAGCAGCCGCGGTAATACGTAGGGGGCTAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTTCGCAGGCGGAAATGCAAGTCAGATGTGAAAGGCAAAGGCTCAACCTTTGTAAGCATCTGAAACTGTATTTCTTGAGATGTGGAGAGGCAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAATACCAGTGGCGAAGGCGACTTGCTGGACACAAACTGACGCTG
+
AAAGAGGIGIIIIIGIGIIIGGGGIGAGGGA.GGIGGAGGGGAGG<AAAGGGIGIIIGAG.GG<<GGGGGGAGGGGGGGGIGGGIIGGGGIGAGGGGGGGGGAAGGGGIGGGGGGGIIIGGGGIIGGGGGIG.AGGIGGGGAGAGGGGGIGGGAGGGGGGGAGGGIIIGGGGGGGGGGGGGIIGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGAGGGGGGIGGII.<G.A..7AGGGGGGGAGGGGGGG
@D00420:169:HVM3LBCXY:1:1101:3837:2158 1:N:0:GGCTAC
GAGATCAGGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGGCTGGTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGGGAAATTGCAGTTGATACTGTCAGTCTTGAGTACAGTAGAGGTTAGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCAATTGCGAAGGCAGCTGACTAGCCTGTTACTGACACT
+
<G<...GGGGIIIIIIGIGIIIIIIIIIIIGGIII.GGIAGIIIGGI..GGGI.GGIGGGII<.<GGGIAGGGGIIIGGGGI..<<.GGGGGIGIAGGIIGGG.<....<GGIAGIII...<<G.GGG..<GGGGGIGIG..<<GGAGIGGG<.A.<GAGIA<AG..<GGGAAGGGGIIIGGIIIIGGGGIII.GGGGGGIG.AGA.AGG..AG.AGAGIGIGGGIIGGGGGGAGG..7.7AGGIIIGGGA
@D00420:169:HVM3LBCXY:1:1101:6189:2220 1:N:0:GGCTAC
GTATCTAATCCTGTTTGATACCCACACTTTCGAGCATCAGCGTCAGTTACAGTCTGGTAAGCTGCCTTCGTAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTACCTCTGATGCACTCAAGACACCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAGCTGACTTAAGCATCCGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCG
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIGIIIIIGGGIIIIIIIIIIIIIIIIGIIIIIIGIIIIIGIIIIIAGGGIGGIGIGGGIGGGIIGGGGGGGGGIGGGGGG<
@D00420:169:HVM3LBCXY:1:1101:6081:2225 1:N:0:GGCTAC
GGACTACGGGGGTATCTAATCCCATTCGCTCCCCTAGCTTTCGTCTCTCAGTGTCAGTGTCGGCCCAGCAGAGTGCTTTCGCCGTTGGTGTTCTTTCCGATCTCTACGCATTTCACCGCTCCACCGGAAATTCCCTCTGCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGATTTGACGGCGAACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCG
+
GGGGGIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIGIIIIIIIIIIIIGIIIIIIIIIIIIIIGGIIIIIIIIIIIGGIIIGGIIIIIIIIIIIIIIIIIGIIIIIIIIGIIIIIIIGGGIIGGIIIIIIIGIGGGIIIGIIIII<GGGGIIIIIGGIGGIIGIIG<GGAGIGIGGGIIIIGGIIGAGGGAGGGAGAGGGGGIGGGGGGGII.
@D00420:169:HVM3LBCXY:1:1101:6834:2116 1:N:0:GGCTAC
GAGTCAGAGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGCCTGAAGTTAAAGGCTGTGGCTTAACCATAGTATGCTTTGGAAACTGTTTACCTTGAGTGCAAGAGGGGAGAGTGGAGTTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTG
+
GGGGGIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGIGIIIIIIGIIIIGGGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIGGIIIIIIIIIIIIIIIGIGIGIGGIIGIIIIIIIIIGIGGGIIIGGIIIIIGIIIIIIIIGGGGGGGAGGGIGGGGGIIGIGGGGGGGGGGGGGGGGGGGGGGGGAAGGGGGIGGGGGGGGGGGIGIGGGGIGGIGGGGGGAAAAGGGG



R2.
@D00420:169:HVM3LBCXY:1:1101:1396:2224 2:N:0:GGCTAC
CCGGGGCATCTAATCCTGTTTGCTACCCACGCTTTCGTGCCTCAGCGTCAGTTGCAGTCCAGAAACCCGCCTTCGCTACTGGTGTTCTTCCCAATATCTACGCATTTCACCGCTCCACTGAGAATTCCACTTTCCTCTCCTGCACTCAAGTCATCCAGCTTCGGCTGCTTAATCCGGTTGAGCCGTAGCCTTTAACCACCGACTTGAAAGCCCACCTACGCACCCTTTACGTCCAGTACATCCGT
+
AGGGII...<<G.GGGGGAAAGGGAGAAGGGIAAGGAG.GG.GGA.<<.<GGGGG.<<GGGAAGG..GAG.<GGAG<.<<<G.GA<A.<AGAG.<...<GGGGG...<.<G<<A.A.<A...<.GGGAAGA.<AA..<GGAAG.<AG.<.<..<.AGG.<..<<A..G7.<.G.GAA.7.7AAGGGAGAA.77GA7GAAA<<..77..........7...77AG7AAGAA......77.7.7.77
@D00420:169:HVM3LBCXY:1:1101:1553:2151 2:N:0:GGCTAC
CAGTGCCAGCCGCCGCGGTAATACGTAGGGGGCTAGCGTTATCCGGTATTATTGGGCGTAAAGGGTGCGTAGGCGGCCCATTAAGTTTGTGGTGAAAGCGTGGGGCTTAACCCCATATAGCCATAGATACTGGTAGGCTAGAGTACAGGAGAGGGCAATGGAATTCCATGTGTAGCGGTAAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGTTGCCTGGACTGATACTGACGCTG
+
IIIIIIIIIIIIIIIGIIIGIIIIIGIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIGIIIIGIIIIIGIGGGAGGIGIGIIIIIIIIIIIIGGIIGGGIIIIIIIIIIIIGIIIIGIIAGGGGIIGGGAGGIIIIIIIIIIIIIIII.GGGGIIIGGGIIGGGIGGAGGGAGAGGIGIIIIGAGGGGGGIIIGGGGIGGAGGGAGGGGAAAGGIIIGGGAG.
@D00420:169:HVM3LBCXY:1:1101:2313:2121 2:N:0:GGCTAA
NNNNNNNNGCNNNNNCGGNNATANNGNNGATNNNNNCGNTATCCGGATTTATTGGGCGTAAAGGGAGCGTAGGTGGGCTATTANNNCTGGNGNNNNNGGCAGTGGCTCAACCATAGAATGCAAGGGGATACGGTCGACCTGAGGTCAGGAGAGGAGAGTGGGAATCCGTGGGTAGGGGGGGAATGGGTAAGTAATCGGAGGAACACCGAAGGCGGAAGCGGCTTTACGACCTGGTTACTAACATT
+
########<<#####<<G##<<G##.##<..#####<<#<.<GGGGGGGIIIIIIIA.<GGII.GGGGI<GGGGIG..G..<G###<.<.#<#####<...<.<GGGIIGGI.<...<G........<..<.<G.....<........<....<......7.<.<.<.<G.......7..77G.77.7..7......7....7...7......7.7<G.......7..........7......7.
@D00420:169:HVM3LBCXY:1:1101:2583:2197 2:N:0:GGCTAC
CGGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTCTAACAGTTTCCAAAGCGAACAATGGTTAAGCCACTGCCTTTAACTTCAGACTTATTAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGT
+
GGIIIIGIGGGGGGGGIIIIIIGGGIIGIIIIIIGIIGGGGGAGGGGGAGGGGGGGAGGGGGGGGGGGGIIGGGGGIGGIIGGGAGAGGGGIIIIGGGGGGII<AGGGGGGGAGGGAGGAGGGIGGGAGGGAGGGGGIIGGIIGIGGGGGAGGAAGGGGGGAGGGGGGG<AA.7GGGGGIAGIGGGGGIGII7AGGGGGGGAGGAGGAGGIGGGGIGAG<<GGGGGAGGGGAG<GAA7AAGG.7.
@D00420:169:HVM3LBCXY:1:1101:3030:2133 2:N:0:GGCTAC
CCCGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTTCCTCAGCGTCAGTTTGTGTCCAGCAAGTCGCCTTCGCCACTGGTATTCCTCCTANTNTCTACGCATTTCACCGCTACACTAGGAATTCCACTTGCCTCTCCACATCTCAAGAAATACAGTTTCAGATGCTTACAAAGGTTGAGCCTTTGCCTTTCACATCTGACTTGCATTTCCGCCTGCGAACCCTTTACGCCCACTAATTCCGT
+
GGGGGGIIGGGGGGIAGAGGAGGGIGAGAAGGGGGGGGGGGGGGIGGGAAGAGAAGGGAGGAAGGAGGGGGIGGGG..GGGGGGGGGIIIIIG#<#<<<GGGIGGAGGGG<.GGGIGAAAAGAGGGGG.GGAAGIIG<GGAGAAGGAGAG<GGG.GAAGGAGGAA..<<.AGG.AA.<G7.<G7GG.7.7A77..77AAGG7.AGAAG.7A.A7GGG...7.77.7AG.7..<A..77.GA....
@D00420:169:HVM3LBCXY:1:1101:3837:2158 2:N:0:GGCTAC
CGGGGGTATCTAATCCTGTTTGATACCCACACTTTCGAGCATCAGTGTCAGTAACAGTCTAGTCAGCTGCCTTCGCAATTGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCTAACCTCTACTGTACTCAAGACTGCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACATCTGACTTACCAGCCCACCTACGCTCCCTTTAAACCCAATAAATCC
+
IG.AGGGIGIIIIIIIIIIIIIIIAGIIIIGGIIIIGGGG.GGGIIGIGIGGIIGG..<GGGAGGGGGIIIGIGIIAGG.GG.<GGIIGGGAGGIGGGIGGGIIGIGGGIIIGIIIIIIAGGGGIGGGGGI.AGGGGG<<..<GAGG....<<<.<GGGGGIIAAGGGGAGG.77GGGGG<AGGI<<.7.77AGGG...7AAGGGGGG....7AGGGAGGGIGGAGGGIGA..AGG.A.7.777
@D00420:169:HVM3LBCXY:1:1101:6189:2220 2:N:0:GGCTAC
GTGTGCCAGCCGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGATGCTTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGTGTCTTGAGTGCATCAGAGGTAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTACGAAGGCAGCTTACCAGACTGTAACTGACGCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIGIIGGIIIGIIIIIIIGIIIGGIIIIIIIIAGGIIIGIIIIGIGIIIIIIIIIIIIIGIIIIIIIGIIGGGGIIIIIIIIIIIIIIIIIIIIGIGGIGGIIIIGIIIIGGAAGGGGGIGGGGGGIIIGGGGGGGIGIIGGGIIIIIIIGGGIIIGGGIAGGAGGGGAGAGGGGGGGGGGGGIIAGAGGGGGGIGAGGGAGGGGIGIGIIAGIGGGGGG.

TonyWalters

unread,
Nov 22, 2017, 1:49:22 AM11/22/17
to Qiime 1 Forum
Hello Umar,

From the first R1 sequence, it looks like you have 8 base pairs plus  your primer:
GAGTCACTGTGCCAGCAGCCGCGGTAA
                    515f primer start

Whereas the second read looks to start with the 806r primer:
GGACTACAGGGGTATCTAATCCTATTTGCTCCCCA
GGACTACHVGGGTWTCTAAT 

So, it looks like you have a single, 8 base pair continuous barcode that is on R1, rather than a barcode that's on both reads. And your reads are mixed orientation. So you do want to use the read orientation option, and add in a ReversePrimer column to your mapping file with the 806r primer in the data cells GACTACHVGGGTWTCTAAT

I think we have a problem though, using the paired end data with a zero length barcode as one of the options (for the reverse read), which we would need for your R2 reads. I've opened an issue here: https://github.com/biocore/qiime/issues/2207

In the meantime, I can think of a potential workaround, and that is to cut off the first base of your ReversePrimer  sequences, so they would be GACTACHVGGGTWTCTAAT. Then call extract_barcodes.py with parameters like this:
--bc1_len 0 --bc2_len 1 --attempt_read_reorientation -c barcode_paired_end

This should create oriented reads as output, as well as "barcodes" that are just the first bases of the reverse primers. You will only use the oriented R1 and R2 reads after this.

Then, I would recommend that you stitch your R1 and R2 reads, e.g, with the join_paired_ends.py script (or outside software, like PEAR, which I get high yields with), then you would want to do an extract_barcodes.py command with these parameters on the stitched reads:
--bc1_len 8

Then you should have a stitched read that matches the V4 amplicon, and a barcode read that can be used as input for split_libraries_fastq.py.
Reply all
Reply to author
Forward
0 new messages