unexpected behavior with header regex

37 views
Skip to first unread message

Lee Brooks

unread,
Apr 19, 2014, 5:34:21 PM4/19/14
to solexaq...@googlegroups.com
Hi,

I am using SolexaQA v2.2
I have been reading your source code because I am interested in parsing the fastq read id headers in my own application. I ran some tests on a snippet of your code. I adapted your code from SolexaQA.pl at proximity line 510, which parses the tile number from the header using regular expressions. I tested the code snippet by copying the header examples from your FAQ "Can the SolexaQA package handle different header line formats?". The unexpected behaviour that I observed was in regards to header ID "@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0". Here is a quote from your FAQ that explains that this read format is not supported.

We are aware of three further variants that SolexaQA does not currently support:

@8:1:670:774
@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0
@SRR797058.1 HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1

Note that these variants have no unique identifiers to distinguish between them, and considered together, there is no way to identify the correct tile number for all of these variants (either 1 or 1101 in the examples above). Consequently, the ability of the SolexaQA package to analyze all FASTQ variants ever released by Illumina is limited.


http://solexaqa.sourceforge.net/questions.htm#headers

I would like to point out to you that your code actually does produce a match to "@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0". The only reason that this particular header results in a SolexaQA error is that it zero is assigned to $tile_number, not because your regular expression caught a bad format. This was unexpected. Do you have a format definition for this type of header that dictates that the field will always be zero?

Thanks for your time and effort developing this extremely useful package.
-Lee

use strict;
use warnings;

#http://solexaqa.sourceforge.net/questions.htm
my @read_id = ('@EAS139-23_0002:1:100:1142:10040#ATCACG/1',
'@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG',
'@AGRF-23_0002:1:100:1142:10040#0/1',
'@114222772_110210mm_lane7.txt_SN603_WA003:7:1:2036.50:19999.80#0/1',
'@114222772_110210mm_lane7:7:1:2036.50:19999.80#0/1',
'@txt_SN603_WA003:7:1:2036.50:19999.80#0/1',
'@7.t:7:1:2036.50:19999.80#0/1',
'@HWUSI-EAS1758R:18:62RU0AAXX:5:1:1030:952 1:N:0:',
'@EAS139:136:FC706VJ:2:1101:15343:197393 1:Y:18:ATCACG',
'@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36',
'@HWI-ST938:49:B01D4ACXX:5:1101:2169:194234',
'@HWI-EAS209_0025_FC427:6:1:1041:14884#ACAGTG/2',
'@M00143:1:000000000-A0AH7:1:1:16598:1320 1:N:0:1',
'@8:1:670:774',
'@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0',
'@SRR797058.1 HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1');

foreach(@read_id){
    my $line = $_;
    # retrieve sequence information
    my $tile_number;
   
    # regular expression to find tile number from Solexa fastq file sequence IDs
    $tile_number = 0;
    if( $line =~ /\S+\s\S+/ ){
    # Cassava 1.8 variant
        if( $line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w\-]+:[\d\w]+:(\d+)/ ){
            $tile_number = $1;
            print "Found a tile number: $tile_number\n";
    # Sequence Read Archive variant
    }elsif( $line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ){
        $tile_number = $1;
        print "Found a tile number: $tile_number\n";
    }
    # All other variants
    }elsif( $line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ ){
        $tile_number = $1;
        print "Found a tile number: $tile_number\n";
    }
   
    if( !$tile_number ){
        print "Error: $line not in correct Illumina format\n";
    }else{
        print "The tile number for $line is: $tile_number\n"
    }
}

MPC

unread,
Apr 19, 2014, 8:30:29 PM4/19/14
to solexaq...@googlegroups.com
Hi Lee,

Thanks for getting in touch.  (Albeit about probably one of my least favorite topics - read IDs!)

In short, we've never been able to figure out a regular expression that correctly parses all the header lines that we know about.  (As far as we can tell, it's logically impossible to do so - although I would be quite happy to be proven wrong on this point).

The header line under interest is this one:

@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0

If I understand right, you thought SolexaQA might return an error if it detected header lines of this type?  Instead, the software incorrectly assigns the tile numbers (here, thinking the tile number is '0' when it is actually '1').  Am I following your email correctly?

The fact that we can't develop a regular expression to read header lines of this type also means that we cannot reliably detect that this is an incorrect header (and thus return an error) in the first place.  Instead, we have to leave checking of input files up to the user.  (In practice, this works quite well because the header lines that SolexaQA does not support are relatively uncommon).

I would be very interested to know if there is a better way of parsing these header lines though.  Have you come up with any strategy that is more robust?

-Murray

Lee Brooks

unread,
Apr 20, 2014, 10:47:04 AM4/20/14
to solexaq...@googlegroups.com
Hi Murray,

I am fortunate that you have addressed the header problem in your code. I can't think of any better way to do this. My question is motivated by my selfish desire to translate your well-thought-out header parsing routine for use in my own application. When I translated your code, I found that one out of the three unsupported header formats listed in your FAQ is handled differently than the other two. The header line in question:


@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0

does indeed cause SolexaQA to produce an error. However, I noticed that your regular expression actually recognizes it as a match and parses a tile number, which is zero. Perhaps by coincidence, SolexaQA produces an error because when evaluating the statement:

if( !$tile_number ){
    die "Error: Lane ID at line number $line_counter not in correct Illumina format";
}

The assigned $tile_number of 0 causes the program to die. This is because the pattern is actually matched but the tile number parsed is zero. Whereas the other two unsupported header formats are handled differently; they do not match you regular expressions at all. I guess my concern is that some user would come along with a file that contains the unsupported header type in question ( @USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0 ), except perhaps there is not a zero in the matched position -- perhaps it is a 9. This would look like this: "@USI-EAS45-42B86AAXX:7:1:1793:1105:9:1:0". I think that SolexaQA would not produce an error and would carry on thinking that 9 is the right answer.

I'm glad to hear that this unsupported variant is very rare. That is probably the answer to my question. I think that I am worrying about this rare header variant too much. I guess it would be much easier if there was a standardized header format. Thank you for all the work that you have done. The variety of non-standard header formats make my head ache.

cheers,
Lee

MPC

unread,
Apr 20, 2014, 4:06:57 PM4/20/14
to solexaq...@googlegroups.com
Hi Lee,

Yes, that logic is formally right.

In practice, I believe those characters encode either down/up or forward/reverse, and so will always be 0 or 1.  In any normal file, the 0 would come first.  It's always possible that the code would not throw an error on some file, but it wouldn't produce normal output either.  (I think it would probably create a graph with one tile, or more likely, throw an error elsewhere in the code).  Given how rare this particular header is (I only saw it for a brief while some years ago), this probably isn't a huge deal.

That said, you've got me wondering whether there's a better way to do this header parsing.  A whole series of tests isn't great, because the tests are evaluated millions of times and produce a noticeable slowdown in the program runtime.  However, we have toyed with an initial evaluation step, and then a simple capture for the rest of the file.  We might play around with that idea again.

If you have any thoughts, please let me know!

-Murray

MPC

unread,
May 12, 2014, 2:27:09 AM5/12/14
to solexaq...@googlegroups.com
Hi Lee,

Just a quick follow up.  We have now re-written our algorithm for parsing header lines, and I have just uploaded a new version of the software (v. 2.3) to the SourceForge website.  This new version can read all the header lines that we know about, as described in the revised FAQ section.  Runtime speed remains unchanged.

-Murray

Lee Brooks

unread,
May 14, 2014, 11:32:36 AM5/14/14
to solexaq...@googlegroups.com
Hi Murray,

The readability of your code is outstanding. This is hugely informative.

I'm trying to build a program that stores read IDs in a data structure that provides fast membership testing. My intent is to parse the cluster coordinates to provide a concise unique primary key.

At any rate, thanks for the update!

-Lee
Reply all
Reply to author
Forward
0 new messages