convert fastq to 454 gives negative quality values

ipso Facto

unread,

Nov 24, 2013, 11:49:10 PM11/24/13

to biop...@googlegroups.com

I tried using the read_fastq | write_454 pipeline.

$ read_fastq -i jj.fastq -e base_33 | write_454 -o jj.fna -q jj.qual -x
It gives me negative score values in the qual file.

$ read_454 -i jj.fna -q jj.qual
The above command can translate it back to fastq perfectly fine.

But when I try to run a 454 assembly, I get an error.
Error: Non-digit character '-' appears in quality score file: jj.qual

This is the first time I am trying to use Illumina reads as a supplement to 454 data.
Should I also be looking into changing the headers so that they are more like 454 sequence headers.

Below are first few offending lines from the files.
Any help will be greatly appreciated.

.fastq
@HS3_366:6:1101:10000:116055/1
CCGATCAAACAACAACCGTACCGTGTCTCGGGTGCGGAAGGGGAAGTCATGGAAGCTGAGGTTGACCAGTACTTGGGGATGGGCTTCATTCGGCCTTCCC
+
@BCFFFDFHHHGFIIGDGHIIIJCEFHGGGHI?FAHH;;FHIJ==;7@>3?D>@;;((-;=5;9(>@:<3:5(;@@(&(&+8A89?934(4:(&)0@:?C
@HS3_366:6:1101:10000:11853/1
TAGCTTCATCGAACGGCAGGGGGGGGGCTTCCCAAAAACCCCCCCCCCCCCCCCCCCCCTCACTACCCACCCCCCCCTCCCAAAAAAGAAAAACACCCAC
+
@@@DDDDDHHHHHIICABDGE###############################################################################

.fasta
>HS3_366:6:1101:10000:116055/1
CCGATCAAACAACAACCGTACCGTGTCTCGGGTGCGGAAGGGGAAGTCATGGAAGCTGAGGTTGACCAGTACTTGGGGATGGGCTTCATTCGGCCTTCCC
>HS3_366:6:1101:10000:11853/1
TAGCTTCATCGAACGGCAGGGGGGGGGCTTCCCAAAAACCCCCCCCCCCCCCCCCCCCCTCACTACCCACCCCCCCCTCCCAAAAAAGAAAAACACCCAC

.qual
>HS3_366:6:1101:10000:116055/1
0 2 3 6 6 6 4 6 8 8 8 7 6 9 9 7 4 7 8 9 9 9 9 3 5 6 8 7 7 7 8 9 -1 6 1 8 8 -5 -5 6 8 9 9 -3 -3 -5 -9 0 -2 -13 -1 4 -2 0 -5 -5 -24 -24 -19 -5 -3 -11 -5 -7 -24 -2 0 -6 -4 -13 -6 -11 -24 -5 0 0 -24 -26 -24 -26 -21 -8 1 -8 -7 -1 -7 -13 -12 -24 -12 -6 -24 -26 -23 -16 0 -6 -1 3
>HS3_366:6:1101:10000:11853/1
0 0 0 4 4 4 4 4 8 8 8 8 8 9 9 3 1 2 4 7 5 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29 -29

Martin Asser Hansen

unread,

Nov 25, 2013, 4:20:01 AM11/25/13

to biop...@googlegroups.com

Are you sure your data is encoded with base 33? If you omit the -e base_33 then read_fastq will try to automagically determine the encoding and will raise an error if that fails. You can also try the analyze_scores Biopieces to get an idea of the scores:

read_fastq -i jj.fastq -n 100 | analyze_scores | analyze_vals | write_tab -cpx

Or plot_scores:

read_fastq -i jj.fastq | plot_scores -t x11 -x

Cheers,

Martin

--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ipso Facto

unread,

Nov 25, 2013, 3:15:13 PM11/25/13

to biop...@googlegroups.com, ma...@maasha.dk

Hi Martin,

Yes, I am sure data is base33.

I forgot to mention, auto-detect on read_fastq is not working. It throws an error.
$ read_fastq -i jj.fastq | plot_scores -t x11 -x
/biopieces/bp_bin/read_fastq:115:in `block (4 levels) in <main>': Could not auto-detect quality score encoding (SeqError)
    from /biopieces/code_ruby/lib/maasha/filesys.rb:102:in `each'
    from /biopieces/bp_bin/read_fastq:108:in `block (3 levels) in <main>'
    from /biopieces/code_ruby/lib/maasha/filesys.rb:74:in `open'
    from /biopieces/bp_bin/read_fastq:107:in `block (2 levels) in <main>'
    from /biopieces/bp_bin/read_fastq:106:in `each'
    from /biopieces/bp_bin/read_fastq:106:in `block in <main>'
    from /biopieces/code_ruby/lib/maasha/biopieces.rb:89:in `open'
    from /biopieces/bp_bin/read_fastq:51:in `<main>'
/biopieces/bp_bin/plot_scores:74:in `[]': empty range (IndexError)
    from /biopieces/bp_bin/plot_scores:74:in `<main>'

And if I use base_64, it gives an error as well.
$ read_fastq -i jj.fastq -e base_64 | plot_scores -t x11 -x
/biopieces/bp_bin/read_fastq:123:in `block (4 levels) in <main>': Quality score outside valid range (SeqError)
    from /biopieces/code_ruby/lib/maasha/filesys.rb:102:in `each'
    from /biopieces/bp_bin/read_fastq:108:in `block (3 levels) in <main>'
    from /biopieces/code_ruby/lib/maasha/filesys.rb:74:in `open'
    from /biopieces/bp_bin/read_fastq:107:in `block (2 levels) in <main>'
    from /biopieces/bp_bin/read_fastq:106:in `each'
    from /biopieces/bp_bin/read_fastq:106:in `block in <main>'
    from /biopieces/code_ruby/lib/maasha/biopieces.rb:89:in `open'
    from /biopieces/bp_bin/read_fastq:51:in `<main>'

I also ran,
$ read_fastq -i jj.fastq -e base_33 | plot_scores -t x11 -x

Attached is the gnuplot output from both commands.

base33.png

base64.png

Martin Asser Hansen

unread,

Nov 25, 2013, 3:33:14 PM11/25/13

to biop...@googlegroups.com

OK, you found a bug - a while back I changed the internal encoding from base 64 to base 33, but I missed write_454 since it was not tested properly :o(

Thanks for the heads up!

Run bp_update and try again.

Cheers,

Martin

ipso Facto

unread,

Nov 25, 2013, 4:28:41 PM11/25/13

to biop...@googlegroups.com, ma...@maasha.dk

Thanks Martin.
The quality conversion seems to work when I specify the encoding.

$ read_fastq -i jj.fastq -e base_33 | write_454 -o jj.fna -q jj.qual -x

You say write_454 was not tested properly. Should I be using it?
Also, do you know if I will have to change the Illumina sequence headers to 454 headers, or is it built-in?

The read_fastq still does not autoDetect though. Same error as above.
$ read_fastq -i jj.fastq | write_454 -o jj.fna -q jj.qual -x

/biopieces/bp_bin/read_fastq:115:in `block (4 levels) in <main>': Could not auto-detect quality score encoding (SeqError)
    from /biopieces/code_ruby/lib/maasha/filesys.rb:102:in `each'
    from /biopieces/bp_bin/read_fastq:108:in `block (3 levels) in <main>'
    from /biopieces/code_ruby/lib/maasha/filesys.rb:74:in `open'
    from /biopieces/bp_bin/read_fastq:107:in `block (2 levels) in <main>'
    from /biopieces/bp_bin/read_fastq:106:in `each'
    from /biopieces/bp_bin/read_fastq:106:in `block in <main>'
    from /biopieces/code_ruby/lib/maasha/biopieces.rb:89:in `open'
    from /biopieces/bp_bin/read_fastq:51:in `<main>'

Martin Asser Hansen

unread,

Nov 26, 2013, 2:40:22 AM11/26/13

to biop...@googlegroups.com

read_454 and write_454 are now perfectly ok to use. I have added an issue about including these Biopieces in the integration test suite - which will happen before other upgrades.

The failing encoding detection by read_fastq is quite normal - it just tells us that with the data in the first entry the encoding cannot be unambiguously determined and that it must be specified with the -e switch.

Conversion of headers is not done by read/write_fastq/454 - they use the names provided. However, I have not encountered anywhere where header specific information from 454 reads were required in assembly? As for pair-end/mate-pair Illumina reads it is a different story.

What sort of data do you have? single-end/pair-end/mate pairs?

What are you trying to assemble? prokaryote/eukaryote?

Genome size?

What assembler are you using?

Cheers,

Martin

ipso Facto

unread,

Nov 27, 2013, 7:42:46 PM11/27/13

to biop...@googlegroups.com, ma...@maasha.dk

Thanks Martin,

I have paired end data from both Illumina and 454 for a eukaryote. Expected genome size ~100Mb.
I was trying to do a hybrid assembly using Newbler. Like you said, I did not encounter any issues with sequence headers.
However, I ran out of memory. I had 72Gb of RAM at my disposal, but that was not enough.

Do you have suggestions for doing hybrid assembly using 454 and Illumina data?

Martin Asser Hansen

unread,

Nov 28, 2013, 9:35:28 AM11/28/13

to biop...@googlegroups.com

Assembly is a huge and complicated topic. First you need to set a goal for your assembly. I spend 10% of my time getting 90% of the assembly in order - and then I spend 90% of my time on the last 10%. I find that the last 90% of my time is mostly wasted.

Check out what other people are doing on biostars.org and seqanswers.com. There are several strategies, but everything is a compromise. I recommend assembling the 454 reads with Newbler and Illumina reads with Velvet (using high kmer size perheps 60-80) or Spades. Then combine the assemblies with Zorro or GAA. It is possible to do hybrid assemblies with Ray or MIRA as well. MIRA is the most clever assembler, but it is slow.

Martin

Reply all

Reply to author

Forward