Creating FASTQ files for SRA submission

235 views
Skip to first unread message

Becky

unread,
Apr 8, 2016, 7:17:51 PM4/8/16
to Qiime 1 Forum
Hi,
I am trying to create FASTQ files for NCBI SRA submission. I ran into the issue of having the number of bases and the number of quality scores not matching up (ex: 250 base & 260 quality scores). I realized that splitting_libraries.py was filtering the fasta so it was lining up with my qual file. I then found that split_libraries.py had a -d parameter which creates a qual file. The issue is when I try to use that qual file to create a FASTQ I am getting the error below. Any idea what is going on?

qiime@qiime-VirtualBox:~$ make_fastq.py -f /home/qiime/split_library_IJM7PW102_C/seqs.fna -q /home/qiime/split_library_IJM7PW102_C/seqs_filtered.qual -s -o IJM7PW102_FASTQ_C
Traceback (most recent call last):
  File "/home/qiime/qiime_software/qiime-1.7.0-release/bin/make_fastq.py", line 76, in <module>
    main()
  File "/home/qiime/qiime_software/qiime-1.7.0-release/bin/make_fastq.py", line 70, in main
    make_fastq_multi(in_fasta, quals, opts.result_fp)
  File "/home/qiime/qiime_software/qiime-1.8.0-release/lib/qiime/make_fastq.py", line 83, in make_fastq_multi
    for rec, label in iter_fastq(in_fasta, quals, label_transform):
  File "/home/qiime/qiime_software/qiime-1.8.0-release/lib/qiime/make_fastq.py", line 64, in iter_fastq
    qual = quals[qual_id]
KeyError: 'IJM7PW102BQSDC'


Becky

unread,
Apr 8, 2016, 10:37:11 PM4/8/16
to Qiime 1 Forum

So I saw another post which suggested using convert_fastaqual_fastq.py then split_libraries_fastq.py. When running the script: split_libraries_fastq.py I think the error I am getting may be related to the barcode_read_fps. Should this info be available in the mapping file?

I keep getting an error:

qiime@qiime-VirtualBox:~$ split_libraries_fastq.py -i /home/qiime/IC9LIJC01_FASTQ_CB/UKY_run1.fastq -m 454_UKY_run1B_no_redos.txt -b UKY_run1_barcodes --store_demultiplexed_fastq --barcode_type 10 --retain_unassigned_reads -o IC9LIJC01_split_fastq

Traceback (most recent call last):
  File "/home/qiime/qiime_software/qiime-1.7.0-release/bin/split_libraries_fastq.py", line 354, in <module>
    main()
  File "/home/qiime/qiime_software/qiime-1.7.0-release/bin/split_libraries_fastq.py", line 333, in main
    for fasta_header, sequence, quality, seq_id in seq_generator:
  File "/home/qiime/qiime_software/qiime-1.8.0-release/lib/qiime/split_libraries_fastq.py", line 293, in process_fastq_single_end_read_file
    ("Headers of barcode and read do not match. Can't continue. "
qiime.split_libraries_fastq.FastqParseError: Headers of barcode and read do not match. Can't continue. Confirm that the barcode fastq and read fastq that you are passing match one another.

jonsan

unread,
Apr 11, 2016, 11:52:55 AM4/11/16
to Qiime 1 Forum
Hi Becky,

How exactly are these libraries constructed and sequenced? Are the barcodes read by the sequencer in a separate read? Currently, it looks like split_libraries_fastq is looking for a 10bp barcode sequence in the file UKY_run1_barcodes for each sequence in the file UKY_run1.fastq. Most of the 454 libraries I've worked with have the barcode inline as part of the sequence read. I need to check to be sure, but I think split_libraries_fastq.py will only work if provided with a separate barcode read. 

I think your first approach is probably the right one. Once you've generated the qual file and demultiplexed fasta seqs.fna using split_libraries.py with the -d option, pass both of those to the convert_fastqaqual_fastq.py script along with the --multiple_output_files option. That will produce a single fastq file for each sample.

Hope that helps,
-jon

Becky

unread,
Apr 13, 2016, 8:52:26 PM4/13/16
to Qiime 1 Forum
Jonsan, 
Thanks for the direction, using convert_fastqaqual_fastq.py script worked! I have a question about the output for the script.  I see (sample name).fastq file and a seqs_(sample name).fastq file, what is the difference between the two. I have multiple 454 runs and 2 out of 4 runs only produced seqs_(file name).fastq files. I am assuming that (sample name).fastq file is what you want to use for your fastq file. Any idea why I am only getting one format and not both?
Becky

jonsan

unread,
Apr 14, 2016, 1:57:27 PM4/14/16
to Qiime 1 Forum
Hi Becky,

Sounds like we're getting closer!

Can you paste the first few lines from an example seqs_(file name).fastq and from a (sample name).fastq file, along with the exact command you ran to get this output?

Thanks,
-jon

Becky

unread,
Apr 14, 2016, 8:53:49 PM4/14/16
to Qiime 1 Forum
Jon,
Let me know if there is anything else that could help.

Script:
qiime@qiime-VirtualBox:~$ convert_fastaqual_fastq.py -f /home/qiime/split_library_IJM7PW101_C/seqs.fna -q /home/qiime/split_library_IJM7PW101_C/seqs_filtered.qual -m --multiple_output_files -o IJM7PW101_FASTQ

Here is the first couple lines from seqs.29.V.12.1.2H.fastq :
@29.V.12.1.2H_500003
CCTGTTCGCTACCCACGCTTTCGCTCCTGAGCGTCAGTACCAGTCCAGGTAGCCGCCTTCGCCACTGGTGTTCCTCCCAATATCTACGCATTTCACCGCTACACTGGGAATTCCGCTACCCTCTCCTGGCCTCTAGCGAGACAGTTCCAAAGGCAGTTCCTCAGTTGAGCTGAGGGATTTCACCTTTGGCTTATCAAGCCGCCTACACGCCCTTTACGCCCAATGATTCCGAATAACGCTTGCCCCTCGCA
+
IIIIIIIIIIIHHHIIIIGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIHHHIIIIIIIIIIIIHHHIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIHHHHIICCCCCIIIIIIIIIIIIIIIIIIIIIIHHHIHHHIIIIHHHIIIIIDDDHEEEEE<<??BCCICCCED??9?EGGHHIIIHHHHI???BIIIEGEA24427/=8:

Here is the first couple lines from 29.V.12.1.2H.fastq :
@29.V.12.1.2H_4 read_id=IJM7PW101AKL16 barcode=TACGAGTATG
CCTGTTCGCTACCCACGCTTTCGCTCCTGAGCGTCAGTACCAGTCCAGGTAGCCGCCTTCGCCACTGGTGTTCCTCCCAATATCTACGCATTTCACCGCTACACTGGGAATTCCGCTACCCTCTCCTGGCCTCTAGCGAGACAGTTCCAAAGGCAGTTCCTCAGTTGAGCTGAGGGATTTCACCTTTGGCTTATCAAGCCGCCTACACGCCCTTTACGCCCAATGATTCCGAATAACGCTTGCCCCTCGCATTACCGCGGCTGCTGGCACCTGATGGCGCGA
+29.V.12.1.2H_4 read_id=IJM7PW101AKL16 barcode=TACGAGTATG
IIIIIIIIIGDCIIII@A222233=IFGGIIIIIIIIIIIIHHHIIIIGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIHHHIIIIIIIIIIIIHHHIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIHHHHIICCCCCIIIIIIIIIIIIIIIIIIIIIIHHHIHHHIIIIHHHIIIIIDDDHEEEEE<<??BCCICCCED??9?EGGHHIIIHHHHI???BIIIEGEA24427/=8:9==@BB;=?EEIIIIIIIHCDCEGEEGEE=9

jonsan

unread,
Apr 15, 2016, 1:58:27 PM4/15/16
to Qiime 1 Forum
Strange. And these are both coming from the same seqs.fna input file? 

When you say you have multiple 454 runs and some do this while others don't, are all of those runs represented in the same seqs.fna file above? 

My guess is there's something strange going on with the sequence headers in the input fasta file. In particular, the fact that the bottom example has the additional information after the sequence name (like the read ID and barcode), but the top doesn't, makes me wonder if there is something different about those going in. How did you get from original data to the seqs.fna file? And to go along with the previous example, can you attach the output file from the following command? 

grep '29.V.12.1.2H' seqs.fna > sample_29_headers.txt

If I'm right, then the lines in that file that start with '>29.V.12.1.2H_500003' and '>29.V.12.1.2H_4' should be different somehow...

-jon

Becky

unread,
Apr 15, 2016, 6:41:38 PM4/15/16
to Qiime 1 Forum
Jon,
I have a total of 4 runs. IC9LIJC01 and IC9LIJCO2 were run separately but consecutively. Then IJM7PW101 and IJM7PW102 were run a couple of month later to resequence sample with low reads and included a couple other samples. IJM7PW101 and IJM7PW102 contain both (sample name).fastq file and a seqs_(sample name).fastq file types, while IC9LIJC01 and IC9LIJCO2 only contain seqs_(sample name).fastq file types.


Here is work flow from the IJM7PW101 run:
qiime@qiime-VirtualBox:~$ split_libraries.py -m UKY_redo_map.txt -f /home/qiime/454_UKY_run_2/IJM7PW101.fasta -q /home/qiime/454_UKY_run_2/IJM7PW101.qual -b 10 -n 500000 -o split_library_IJM7PW101/ --reverse_primer_mismatches 1 -z truncate_only -d

I used the original fasta and qual file I received from the sequencing center. then... 

qiime@qiime-VirtualBox:~$ convert_fastaqual_fastq.py -f /home/qiime/split_library_
IJM7PW101_C/seqs.fna -q /home/qiime/split_library_IJM7PW101_C/seqs_filtered.qual -m --multiple_output_files -o IJM7PW101_FASTQ
Using the fasta and qual file created from split libraries.


seqs.fna

Becky

unread,
Apr 15, 2016, 7:01:26 PM4/15/16
to Qiime 1 Forum
Jon,
I tried running the script: grep '29.V.12.1.2H' seqs.fna > sample_29_headers.txt but I couldn't get it to work. I am assuming
grep '29.V.12.1.2H' seqs.fna (means corresponding fasta file from split libraries?) > sample_29_headers.txt (I am unsure what text file you want me to use here)
Becky

jonsan

unread,
Apr 16, 2016, 4:45:48 PM4/16/16
to Qiime 1 Forum
Hi Becky, 


It's looking to me like the seqs.fna file you attached only has sequences from that sample 29 with a format like '29.V.12.1.2H_500003'; in other words, starting with 500000 per your split_libraries.py -n 500000 option. Based on the snippets you pasted in from the '29.xxxxx' vs 'seqs.29.xxxxx' files above, I'm thinking that 'seqs.29.V.12.1.2H.fastq' is derived from this seqs.fna file while the sequences in '29.V.12.1.2H.fastq' must be coming from somewhere else. 

Also intriguingly, the first sequence header you pasted in for '29.V.12.1.2H.fastq': (@29.V.12.1.2H_4 read_id=IJM7PW101AKL16 barcode=TACGAGTATG), has a read_id that matches the read ID of the following sequence in seqs.fna: >29.V.12.1.2H_500003 IJM7PW101AKL16 orig_bc=TACGAGTATG new_bc=TACGAGTATG bc_diffs=0. *That* matches the sequence header ('@29.V.12.1.2H_500003') for the first read in 'seqs.29.V.12.1.2H.fastq', and the sequences mostly match -- except that in the fastq file and the seqs.fna file you attached, the sequence in truncated; while in the file '29.V.12.1.2H.fastq' it has additional bases at the end. 

Putting all this together, it looks to me like the file '29.V.12.1.2H.fastq' is derived from a different run of split_libraries.py on the sequences from batch IJM7PW101. 

Does that make any sense?

-jon

bernd

unread,
Apr 18, 2016, 3:01:42 AM4/18/16
to Qiime 1 Forum
Dear Becky,

I noticed your question and this thread a bit late, but possibly the below provides a different way of preparing data for the SRA. Since you used split_libraries.py I suppose you started with 454 data. The NCBI supports SFF files, so you would not even have to convert to FASTQ; see http://www.ncbi.nlm.nih.gov/books/NBK47529/

Given FASTA and QUAL lines do not match, split_libraries changed the sequence data. I'd propose to submit the "raw" SFF data that has not passed qiime qualitiy filtering etc. You can demultiplex this SFF data using SFF tools (sfffile). How to split an SFF file into one SFF file per sample is explained here: https://microbeatic.wordpress.com/2011/11/15/demultiplexing-sff-files-based-on-barcode/


Kind regards,
Bernd

Becky

unread,
Apr 18, 2016, 4:21:46 PM4/18/16
to Qiime 1 Forum
Bernd,
The only issue is that I don't have access for 454 software, can sfffile be downloaded in virtualbox? All the sequencing for my project was done out of house and a while ago so the sequencing center most likely doesn't have the data anymore.
Becky

TonyWalters

unread,
Apr 18, 2016, 5:18:22 PM4/18/16
to Qiime 1 Forum
Becky, if you can get access to the sfffile/sffinfo software (from Roche, wasn't distributed freely, but that may have changed), you can download it and run it on the virtualbox. I haven't used it myself, but there is also this which you could try running in Windows with your .sff file: http://www.dnabaser.com/download/nextgen-fastq-editor/index.html

Becky

unread,
Apr 18, 2016, 9:47:54 PM4/18/16
to Qiime 1 Forum
Thanks so much Tony! We will see if NCBI takes the sff files I created... fingers crossed!
Reply all
Reply to author
Forward
0 new messages