Determining .fastq file with barcodes

1,550 views
Skip to first unread message

Newbie

unread,
Jun 25, 2012, 9:16:54 AM6/25/12
to Qiime Forum
Hi,

I recently did a run on a MiSeq to sequence the 16S region from a
mixed community sample. It was not a very good run by any means (low
cluster density, ~50% PF), but I was hoping to use the data in a run
through of QIIME. Although I setup my sample sheet with my 12bp
barcodes, my reads were parsed mostly to an undetermined folder
(except for 1 sample). I am trying to see if this is because the only
thing that sequenced was my PhiX control (meaning I have to go back
and fix my original primers :-/ ) or if for some reason the control
software could not sort my sequences because of my sequencing primer
constructs (maybe it was looking in the wrong place for the
barcode?).

I have used QIIME in the past with 454 data, but I was trying to
follow the instructions for processing Illumina data. I used the "head
-n 4" command to see which of my fastq files contained my barcodes,
but I cannot tell from the output.

Below is what I see in my terminal window:

qiime@linux:~/Desktop$ head -n 4 '/home/qiime/Desktop/Shared_Folder/
06222012_Rhizobox_16S/DNA-3_S25_L001_R1_001.fastq'
@M00364:15:000000000-A0PHJ:1:1101:16219:2841 1:N:0:25
GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGTTTATTAAGTCTACTGTTAAAGATCAGGGCTTAACCCTGAGCAGGCAATAGAAACAAATAACCTAGAGAACGGTAGGGGCAGAGGGAATTCCCGGT
+
?????BB?DDDDBBDBFGGCFFHHHCCFHBGHHIIFFHHHHHFFHHHIIHHIGGHHIHAAFFDGE?
EEECGHHHHF-55EGHGFEFHHHFH+?D-7@77656?77777774??6?D>?>DDBBBE.
+534;CC8@:*4***0*00?E*0.'
qiime@linux:~/Desktop$ head -n 4 '/home/qiime/Desktop/Shared_Folder/
06222012_Rhizobox_16S/DNA-3_S25_L001_R2_001.fastq'
@M00364:15:000000000-A0PHJ:1:1101:16219:2841 2:N:0:25
CNNATTTGNTCCCCTAGCTTTCGTCTCTCAGTGTCAGTTTCGGCCCAGCAGAGTGCTTTTGCCGTTTTTTNNTTTTCCGATTTTTACTCATTTTACCACTAAACCGGGAATTTTTTTTTTCCCTACTTTATTTTATTTTTTTATTTTTTTT
+
5##55<=5#555555,-888C8666668-8/9/999.9AF.7+++57+,7,7+5--55E----*55-
+55##555++-5+5555=<+++36=D+6++++63+3++40)30<D**22190'(((//((//.(/
(/.///;(.;((/6?6;'.
qiime@linux:~/Desktop$

This was a PE 2x150 run, and my index/barcode was 12bp. I am not sure
if I am also supposed to be looking for the length of my Read 1 and
Read 2 primers also. I am very new to this type of data, and I was
hoping someone on the QIIME forum could help me.

Thank you in advance!

Tony Walters

unread,
Jun 25, 2012, 12:44:38 PM6/25/12
to qiime...@googlegroups.com
Hello,

Generally our protocol for Illumina data are to generate three reads (forward read, reverse read, and a barcode read).  So the only files you received were the R1 and R2 files?  It may be that the barcode reads are at the beginning of one of the reads.  You could check this by choosing a couple of barcodes from your data, and trying this:
grep -c "^>xxxxxxxxxxxx" /home/qiime/Desktop/Shared_Folder/06222012_Rhizobox_16S/DNA-3_S25_L001_R1_001.fastq
where xxxxxxxxxxxx is your barcode sequence.

This will tell you if a large number of sequences are starting with the barcode.

It may be helpful to check with the sequencing center to determine how they generated the reads (the three read method I mentioned is depicted in figure 1 of this paper: http://www.pnas.org/content/108/suppl.1/4516.full.pdf+html?with-ds=yes), that may help sort out the location of the barcodes.

Hope this helps,
Tony Walters


--




Tony Walters

unread,
Jun 25, 2012, 12:54:33 PM6/25/12
to qiime...@googlegroups.com
Sorry, slight modification to the command I gave above (used to dealing with fasta sequences):
grep -c "^xxxxxxxxxxxx" /home/qiime/Desktop/Shared_Folder/06222012_Rhizobox_16S/DNA-3_S25_L001_R1_001.fastq
where xxxxxxxxxxxx is your barcode sequence.

Bharath

unread,
Dec 6, 2012, 1:28:31 PM12/6/12
to qiime...@googlegroups.com
Hi Tony,

Just following this post on sorting Miseq barcode sequences:  I have a sequence set with the same pattern i.e. barcodes at the beginning of the Fastq files. Also, I have only R1 and R2 files without the barcode reads files.

Is there a way to generate this reads file after locating the barcode sequences in the Fastq sequences?

Any advise in this regard will be immensely helpful.

Thanks,
Bharath

Tony Walters

unread,
Dec 6, 2012, 3:29:45 PM12/6/12
to qiime...@googlegroups.com
Hello Bharath,

Are you using the grep -c command above with some of your barcode sequences and seeing that there are a large number of them showing up at the beginning of one of your read files?

-Tony

--
 
 
 

Bharath

unread,
Dec 6, 2012, 4:23:05 PM12/6/12
to qiime...@googlegroups.com
Hi Tony,

Thanks. Yeah I used the command and observe a large number of them showing up. 

I have pasted the output with numbers for those;

hernandezlab$ macqiime grep -c "TGACTG" HernandezV1V2_S1_L001_R1_001.fastq 
732092

hernandezlab$ macqiime grep -c "GTGACA" HernandezV1V2_S1_L001_R1_001.fastq 
460703

hernandezlab$ macqiime grep -c "TGACTG" HernandezV1V2_S1_L001_R2_001.fastq 
106040

hernandezlab$ macqiime grep -c "GTGACA" HernandezV1V2_S1_L001_R2_001.fastq 
127319

Bharath

Bharath

unread,
Dec 6, 2012, 5:26:41 PM12/6/12
to qiime...@googlegroups.com
Hi Tony,

Also, there were more bar code reads on R2.fastq

 hernandezlab$ macqiime grep -c "TCTGT" HernandezV1V2_S1_L001_R2_001.fastq 
1141857
hernandezlab$ macqiime grep -c "ACGAC" HernandezV1V2_S1_L001_R2_001.fastq 
1332534
hernandezlab$ macqiime grep -c "CGATG" HernandezV1V2_S1_L001_R2_001.fastq 
774894
 hernandezlab$ macqiime grep -c "ATGTA" HernandezV1V2_S1_L001_R2_001.fastq 
592695
hernandezlab$ macqiime grep -c "TGCGA" HernandezV1V2_S1_L001_R2_001.fastq 
871484
hernandezlab$ macqiime grep -c "GACGA" HernandezV1V2_S1_L001_R2_001.fastq 
1799729
 hernandezlab$ macqiime grep -c "GTAATC" HernandezV1V2_S1_L001_R2_001.fastq 
246951
hernandezlab$ macqiime grep -c "GTATC" HernandezV1V2_S1_L001_R2_001.fastq 
1007552
hernandezlab$ macqiime grep -c "GTCTA" HernandezV1V2_S1_L001_R2_001.fastq 
1193736
 hernandezlab$ macqiime grep -c "CAGAG" HernandezV1V2_S1_L001_R2_001.fastq 
2152555
hernandezlab$ macqiime grep -c "CGCTC" HernandezV1V2_S1_L001_R2_001.fastq 
1165142

Tony Walters

unread,
Dec 6, 2012, 5:41:04 PM12/6/12
to qiime...@googlegroups.com
Hello Bharath,

You might want to include a ^ at the beginning of those search strings so it's only checking the beginning of the sequence, for instance:
macqiime grep -c "^TCTGT" HernandezV1V2_S1_L001_R2_001.fastq 

As some of the counts may simply be random chance from the middle of the sequences.

In any case, although it's not QIIME supported, here is a simple parser for pulling off the first X base pairs of a fastq file and writing those X base pairs to one file (the barcode read file) and the remaining sequence to a second file (read file).  As it uses a PyCogent library, you'll need to be in the macqiime environment to run it.

Usage:
python parse_bc_reads.py X  Y  Z  A
where X is the file path of the fastq sequence to parse
Y is the output filepath for the barcode reads
Z is the output filepath for the rest of the sequence after barcode has been trimmed off
A is the length of your barcodes.

-Tony

--
 
 
 

parse_bc_reads.py

Bharath

unread,
Dec 7, 2012, 4:42:02 AM12/7/12
to qiime...@googlegroups.com
Hi Tony

Thanks a lot. That was precisely a question in my mind to generate barcode and a reads file separately. I will try this out and update you.

Appreciate your time and reply.

Bharath

Bharath

unread,
Dec 7, 2012, 2:05:53 PM12/7/12
to qiime...@googlegroups.com
Hi Tony,

This is surely very different now.

hernandezlab$ macqiime grep -c "^TCTGT" HernandezV1V2_S1_L001_R2_001.fastq 

3547

hernandezlab$ macqiime grep -c "^ACGAC" HernandezV1V2_S1_L001_R2_001.fastq 

3462

hernandezlab$  macqiime grep -c "^CGATG" HernandezV1V2_S1_L001_R2_001.fastq 

1941

I will try the script that you shared and update.

Thanks
Bharath

Bharath

unread,
Dec 10, 2012, 3:53:31 PM12/10/12
to qiime...@googlegroups.com
H Tony,

Does this need the 1.5.0-Dev version of Macqiime to run ? I was not able to run this through the parse_bc_reads.py script.

I have 1.4.0 Macqiime installed.

thanks,

Bharath

Tony Walters

unread,
Dec 10, 2012, 3:59:39 PM12/10/12
to qiime...@googlegroups.com
Hello Bharath,

You shouldn't need to upgrade QIIME, but, I am using a newer version of PyCogent than you are probably using.  I'd recommend updating pycogent to run this (actually don't need QIIME to run this, so you could install PyCogent independently of macqiime to run this script).

-Tony

--
 
 
 

Reply all
Reply to author
Forward
0 new messages