Illumina 1.8+ format

289 views
Skip to first unread message

Matthew MacManes

unread,
Jul 31, 2013, 9:54:08 PM7/31/13
to solexaq...@googlegroups.com
Any chance we can get support for new Illumina 1.8+ format described here: http://en.wikipedia.org/wiki/FASTQ_format

Best, Matt


MPC

unread,
Jul 31, 2013, 11:12:30 PM7/31/13
to solexaq...@googlegroups.com
Hi Matt,

Assuming we understand it correctly, the SolexaQA package has supported the Illumina 1.8+ format since at least version 2.0:


Have you found some error that we need to be aware of?

Best
-Murray

Matthew MacManes

unread,
Aug 1, 2013, 12:50:23 PM8/1/13
to solexaq...@googlegroups.com
Murray, 

I am working with a file from the SRA: http://www.ebi.ac.uk/ena/data/view/SRR797058

I get the error:
SolexaQA.pl -p 0.01 test.fastq 
Automatic format detection: Sanger FASTQ format
Error: File test.fastq does not match Solexa ID format (note: this may be caused by an incomplete final entry/empty terminal lines)


The file is Illumina 1.8+, NOT Sanger: It lacks the quality scores ! and ". It is not Solexa, 1.3+ or 1.5+, as it lacks K, L, M... 

The file otherwise seems fine, here is head and tail..


head test.fastq 
@SRR797058.1 HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1
NAAACACCACTTTTTGCACAGCCTGGCCCTGTTAGGGGTACCCTCTTGCAGAAAACCTGTCTGGGCAGGATTACTGTTAGCTTCTGGAACTACCTTATTCT
+
#1=DDDFFGHHHGJJIJIJIIIIIHGCGHIBDHGHGJIFAGHIJJGIGIIJJJJJIJIHJJG77=CHFFDEEEEEDDDCDCCCCC>ACCDCCCCDDCCDED
@SRR797058.2 HWI-ST600:227:C0WR4ACXX:7:1101:16650:2000/1
NCTCGGTGGCAAACGGACAGTGCCATAGGAAGAGACCATTTGTAGATAGTCAAATGGGGAGACAGAGGACGGTTTGAACTCGTGTTCTTCTTCCAGAACCG
+
#4=DDFDFGHHHHIJJJJJJIGIHHIIJJIIIJJGIJJJJJIIIHGIJIHIJJJIJJJIGGFHHHEFFFDDB;?@BD>CCCB?BA?CEDDDDCCD@CAAA9
@SRR797058.3 HWI-ST600:227:C0WR4ACXX:7:1101:17167:2000/1
NCTACCAAAAAAATGCCCGATAATTCTGACCATTCCTTCCTCATTCTCGTCTGGCGTTTGGTCACGACGCACGATACCTTCTGCACTTGTCAAGACAGCGG

tail test.fastq 
+
@CCFFFF?DHDHHIJJJGIIIIIJJIIJJJIIJIJJIJJIIJJJIJJIIJJJJIIIJIFGHHHEHFBDFFFDEEDDDBDCBDCDDDDDDDDDDDDDDCCCC
@SRR797058.5568040 HWI-ST600:227:C0WR4ACXX:7:1103:5613:69796/1
CCCCTTCTCCTGCTCCATTGAATTGGCACTTGATGAGCAGAAGTCCAGTGTGGTGCTGATCTGGGTCAGTCATTCACAAGAGACCACTGCACTTTGATGTG
+
CCCFFFFFHHHHHJIJJJJJIIJJJIIIIJJJJJJJIJIJGHHFHIGIGHBFHFGGHIEIIIIIIFDEHDHIIIJICH@HEFFFFEDCCEEDDDDCDDDCC
@SRR797058.5568041 HWI-ST600:227:C0WR4ACXX:7:1103:5635:69797/1
CACAATCCAGCTGCTCGTGCGCCAGAGTCTGGCACTGCCTCTCAGTACAAAAGAACGGGACTCGGAGCGCTCAGACTCTGACTCTGGCTATTGTGTGGGTC
+
???DDDD8DDD?D@+A<EC71<?8E):1?B*:::?BDDD3DDD*99=@)8B=)5=@ADDDDDDD@898',5=?AA5:::>AA(:>A###############

MPC

unread,
Aug 1, 2013, 5:03:12 PM8/1/13
to solexaq...@googlegroups.com
Hi Matt,

Interestingly, this error doesn't actually indicate a problem with the quality encoding.  SolexaQA correctly recognizes the file as a Sanger variant and applies the correct quality scores.

(There is some debate about what comprises a 'new' FASTQ encoding.  Some people prefer to describe every little change as a new variant, while others tend to recognize just a few major classes of very closely related variants.  From a coding perspective, the Sanger and Illumina 1.8+ formats can be parsed with exactly the same algorithm, so I guess this puts me with the lumpers.  Why Illumina even insists on fiddling in minor ways with the standard FASTQ formats, I'll never know...)

In any case, the error you are seeing is something different.  It reflects another annoying habit - fiddling with the read header lines.

Unlike many other QC programs, SolexaQA explicitly determines quality per tile.  The price of this is quite high - the program has to determine the tile number for every read.  Unfortunately, Illumina and others continually change the header format (although the company deserves considerable credit for mostly sticking with standard formats over the last couple of years):


There are now so many header lines, all very different, that it is logically impossible to parse the tile number from all of them.  The philosophy we have taken is for SolexaQA to support all of the major variants (especially those coming off the latest generation of Illumina machines).

In your example, the Sequence Read Archive (SRA) has modified the read headers in their files:

@SRR797058.1 HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1

This is where your error comes from - SolexaQA can no longer determine the tile number.  (This particular error is actually more often caused by an extra empty line at the end of the file, hence the error message).

The good news is that the solution is simple.  Strip out the extra characters added by the SRA (using sed, awk or your preferred alternative):

@HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1

This reverts the file to a standard Illumina header format and SolexaQA runs the file just fine.

Best
-Murray

Reply all
Reply to author
Forward
0 new messages