process .fastq files for qiime analysis

294 views
Skip to first unread message

jian wang

unread,
Feb 12, 2016, 3:33:14 PM2/12/16
to Qiime 1 Forum

Dear friends,


I am new to the program qiime and having trouble to process the .fastq files downloaded from the NCBI website so that I can perform downstream analysis in qiime.


For our data, we have one .fastq file for one sample. The example read in the .fastq file is as below:


@SRR1023137.1 HPZ94RL02G0U1W length=532

TCAGTCATAGACACCTACCGGGTATCCGAATCCTGTTTGCTTCCCCACGCCTTTCGAGTCCTCAGCGTCAGTTACAAGCCAGAGAGCCGCTTTCGCCTACCGGTGTTCCTCCATATATCTACGCATTTCACCGTCTACACATGGAATTCCACTTCTCCCCTCTTCGCACTCAAGTTAAACAGTTTTCCAAAGTCGTACTATGGTTAAGCCACAGCCTTTAACTTCTAGACTTATCTTAACCGCCTGCGCTCGCTTTACGCCCTAATAACTCCGGACAACGGCTCGGGACCTACGGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCCTTTCTCGGTTAAGATACCGTCACAGTGTGAACTTTCCACTCTCACACTCGTTCTTCTCTTACAACAGAGCTTTTACGATCCGAAAACCTTACTTCACTCACGCGGCGTTGCTCGGTCAGACTTCCGTCCATTGCCGAAGACTTCCCCACTGCTCGCCCCTGAGACTGCCAAGGGCACACAGGGGGGATAGGGGNNGNNNNN

+SRR1023137.1 HPZ94RL02G0U1W length=532

FFFFFFFA@?DB<<<<<880004<?88988ABDDFFCDEFFF;:77A9A<<10000>=<8666<>>???400034>844277669=<932...34667:566668220003899<<???<<<@@?444<44444>=988<==;;66222400000<4577A9AAA?=<<<=973--,,,...,,,,--334:::6445:A?<<<<@BB5555??DB=;;AAB>8554989>>?<4.000<<<=<777<<??98000/3=511311-...---55767--..376666<<<?830004223;9:2.0044><889>?444<<====<43222666//...22;622344448@BBDDDDD?==D?>>888=DD>>?DB====>=8888888<>;64467503,,,,3430,,,72233;;22-2001276467774558::1224::11224---888600222227433,,,,,....0-0000,,,...000..00000,,,,,0011..,,,,,,1332222!!2!!!!!

@SRR1023137.2 HPZ94RL02JBWQM length=516

TCAGTCATAGACACCTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAAGCCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTCTTGCACTCAAGTTAAACAGTTTCCAAAGCGTACTATGGTTAAGCCACAGCCTTTAACTTCAGACTTATCTAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCCTTTCTGGTAAGATACCGTCACAGTGTGAACTTTCCACTCTCACACTCGTTCTTCTCTTACAACAGAGCTTTACGATCCGAAAACCTTCTTCACTCACGCGGCGTTGCTCGGTCAGACTTCCGTCCATTGCCGAAGATTCCTCACTGCTGCCTCCCCTGAGACTGCCCAAAGGGCACACAGGGGGGAGTAGGGNNNNNNN

+SRR1023137.2 HPZ94RL02JBWQM length=516

IIIIIIIIIIIIIIIIIIII666;EIIIIIIIIII:::IB:3311A7IFB@BHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIIGGGGIIIIIIIII@@@@IIIIIIIIIIIIIIIHHHIIICCCIIHHHIIIIIIIIIIIIIIIIIGGGGIIIHHHIIIIIIIIIIIIIIIIIICCCIIIIIIIIIIIH;;;?GHHEIIIEHHGCCCCIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIGGGIIIIC?=;@CDD;711133=;DEEIIIIIIIIIIIIIIIIIIIGGGIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIEEIII?95559=IEGDDD;>>>;;>?;EEE>;<>@AGA===;>>=>A?>:::>><>AA??:777::>AAA@AA@@>>>>==A9:8<<<778----788<=<<<335000555AA@==<006662=)54444!!!!!!!


In the downloaded files, there is no information about the corresponding barcodes and linked primers. After we read the manual, we think that we need to have a reads.fastq, a barcode.fastq, and a map file (which includes the barcode and linked primers) to obtain the .fna file by using “split_libraries_fastq.py”; and then, the analyses (e.g., OTU table) can be done based on the .fna file. 

I wonder how I can extract the barcode and primer information from the .fastq files downloaded from the NCBI website. Any information and help would be really appreciated!

 

Thank you very much.

Best, 

Jian

zech xu

unread,
Feb 18, 2016, 7:30:58 PM2/18/16
to Qiime 1 Forum
Hi Jian,

Yes you need to provide a real read fastq, a barcode fastq and mapping file to demultiplex. You were saying the example sequences is from one sample? If so, then you don't need split_libraries_fastq.py. It is for demultiplexing, ie spltting sequences into each individual samples.

jian wang

unread,
Feb 19, 2016, 10:44:15 AM2/19/16
to Qiime 1 Forum
Hi Zech,

Thank you for your kind response. Can I ask following questions? 

From my understanding, I will need a .fna file to perform the downstream analysis, such as OTU picking. But I don't see how I can obtain this kind of .fna file from the .fastq file downloaded from the NCBI website. 

Appreciate your help. 
Jian

Colin Brislawn

unread,
Feb 19, 2016, 12:04:12 PM2/19/16
to Qiime 1 Forum
Hello Jian,

You can use the script split_libraries_fastq to quality filter the reads in that fastq file, and output the high quality reads into an fna file.

If you want to keep every read in the fastq file, you can just convert it to fasta using this linux command:
sed -n '1~4s/^@/>/p;2~4p' example.fastq > example.fasta
(Of course you would replace the 'example' files with your input fastq and output fasta.

Colin Brislawn

jian wang

unread,
Feb 19, 2016, 4:06:41 PM2/19/16
to Qiime 1 Forum
Hi Colin,

Thank you very much for your kind reply. 

1. I tried the split_libraries_fastq.py script, but it won't work. I am not sure if it is because I have one .fastq file for each sample and I don't have a map file or a barcode file. The command is as below where "SRR1023137.fastq" is a fastq file for one sample. 

split_libraries_fastq.py -i SRR1023137.fastq -o output --barcode_type not-barcoded --sample_ids SRR1023137

The command does not work and gave an error massage as:

"skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: SRR1023137.1 HPZ94RL02G0U1W length=532. This may be because you passed an incorrect value for phred_offset."

I am not sure what this means. Is it because I don't have the barcode and map file?

2. Thank you for the linux command to convert the .fastq file to the .fasta file. It is working like a charm! :)
I then used the pick_open_reference_otus.py script on this .fasta file and I think it is working. This means that the .fasta file does not need the barcode in the description lines like below. Is this correct? If I downloaded the .fasta files from the NCBI website, does that mean that I can use them directly?
  • >PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
  • CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGC
3. Because I have one .fastq per sample and I have ~100 such .fastq files, do I need to covert all the files to .fasta, merge all the .fasta files into one .fasta, and then run pick_open_reference_otus.py? Or I can run pick_open_reference_otus.py for each sample and then there is a way to combine the OTU tables together? 

Really appreciate your kind help!
Jian





Colin Brislawn

unread,
Feb 19, 2016, 6:40:40 PM2/19/16
to Qiime 1 Forum
Hello Jian,

3. Because I have one .fastq per sample and I have ~100 such .fastq files, do I need to covert all the files to .fasta, merge all the .fasta files into one .fasta, and then run pick_open_reference_otus.py? Or I can run pick_open_reference_otus.py for each sample and then there is a way to combine the OTU tables together? 

Thank you for mentioning this. This gives me a much better idea of your data set. I know exactly what you should do.

Take a look at this script. This may be the perfect fit for dealing with your hundreds of fastq files. The output of this script is a single fasta file that is set up for OTU picking.

can I run pick_open_reference_otus.py for each sample
No don't! OTU picking is meant to be run on your fully pooled data set. These various split libraries scripts are all designed to produce a single fasta file containing all reads from all of your samples, which you can then use for OTU picking.


Let us know if you have any other questions!
Colin Brislawn
 

jian wang

unread,
Feb 19, 2016, 10:04:46 PM2/19/16
to Qiime 1 Forum
Hi Colin,

Thank you very much for your prompt reply. I will try the multiple_split_libraries_fastq.py script, and will come back to discuss further.

Appreciate your help. 

Jian 


Colin Brislawn

unread,
Feb 19, 2016, 10:52:00 PM2/19/16
to Qiime 1 Forum
We are here to help!
Colin

jian wang

unread,
Mar 10, 2016, 4:28:56 PM3/10/16
to Qiime 1 Forum
Hi Colin,

I apologize for the delay. I have tried the multiple_split_libraries_fastq.py script but obtained an error about phred-offset (please see below).

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: SRR1023137.2 HPZ94RL02JBWQM length=516. This may be because you passed an incorrect value for phred_offset.

I found some other posts talking about the same error and tried some methods, including changing the parameter file using phred_offset = 33 or 64. However, none of these values is working. I kept obtaining the same error message. I also tried to remove the sequence which the error message referred to, but still not working. 

My command is:

multiple_split_libraries_fastq.py -i fastq_dump -o fastq_out -p ~/parameterfile.txt

I also attached the log file here. Could you please help with this issue?

Thank you very much!
Jian



log_20160310152444.txt

Colin Brislawn

unread,
Mar 10, 2016, 4:53:08 PM3/10/16
to Qiime 1 Forum
Hello Jian,

Thanks for posting the full command you ran. The command looks fine, so I wonder if there is some issue with the paramaters file.

Can you post the first few lines of parameterfile.txt? Maybe the issue is in there.

Colin

jian wang

unread,
Mar 10, 2016, 5:14:07 PM3/10/16
to Qiime 1 Forum
Hi Colin,

Thank you very much for your quick reply. I now attached my parameter file. 

Please let me know if you have other questions. 

Thanks,
Jian
parameterfile.txt

Colin Brislawn

unread,
Mar 10, 2016, 6:08:01 PM3/10/16
to Qiime 1 Forum
Ah ha!

One of your lines says:
multiple_split_libraries_fastq:phred_offset 33
Try replacing that with:
split_libraries_fastq:phred_offset 33

Let me know if that works!
Colin


PS Trying many parameters to see what works is great. I'm glad you are doing that.

 

jian wang

unread,
Mar 11, 2016, 4:32:31 PM3/11/16
to Qiime 1 Forum
Hi Colin,

I just tried to change the parameter file to "split_libraries_fastq:phred_offset 33"

It is working like a magic! :)

Thank you very much for your help!

Jian

Colin Brislawn

unread,
Mar 11, 2016, 5:00:45 PM3/11/16
to Qiime 1 Forum
I'm glad I could help.
That's why we are here.
Colin
Reply all
Reply to author
Forward
0 new messages