LengthSort error: strange header line

50 views
Skip to first unread message

MPC

unread,
Oct 3, 2011, 3:40:46 PM10/3/11
to solexaqa-users
Dear Dr. Cox,
After dynamically trimming my Illumina fastq paired read files with
DynamicTrim.pl, LengthSort doesn't seem to recognize the two trimmed
files as being paired. The error message is:

error: files s_4_1_sequence_TGACCA_0MM.txt.trimmed and
s_4_2_sequence_TGACCA_0MM.txt.trimmed do not seem to be paired

In fact they are paired files. May the fact that the identifier in
the
latter file is ".../3" instead of ".../2" have something to do with
it?

In case this might help, here are the first lines for the very first
entry in each of the two respective paired/trimmed files:

@NIKITA_0798:4:1:16662:2376#TGACCA/1 run=110525_SN403_0798_A818L9ABXX
CCTCAACCATGATGTCGGCCGTGCTCTTGCTCTCCTCCGCGGAGGCCTTTCAGGCCCCTGTGCGTGCAGAGGCGCCATCCCTCAGCCGCA
+NIKITA_0798:4:1:16662:2376#TGACCA/1
ggggggggeggggdggbdggfegegdegggfeeg_gegg[def_fgcag`feecccVccc^ccc[^^aVISFWSHXSV[UddWaN___Q^


@NIKITA_0798:4:1:16662:2376#TGACCA/3 run=110525_SN403_0798_A818L9ABXX
CTTAACCAGTGGCCACAATCATGAACCACCCGAACATCGCGTGGCGG
+NIKITA_0798:4:1:16662:2376#TGACCA/3
[S]^[`ca`^ceeebbc```cb`dbddefVdZbdTJU_^QUVP\[a^

I am quite new to handling RNA-Seq data, so any help or suggestions
you
can give me would be appreciated!
Thanks in advance,
John

MPC

unread,
Oct 3, 2011, 3:41:21 PM10/3/11
to solexaqa-users
Hi John,

Thanks for getting in touch.

Yes, as you guessed, you've encountered a problem with the file
format. I'm not sure where this dataset came from, but it's certainly
not in standard Illumina format, which would look something like this:

@NIKITA_0798:4:1:16662:2376#TGACCA/1
@NIKITA_0798:4:1:16662:2376#TGACCA/2

The /3 is particularly strange -- I've never seen that before.
However, funnily enough, that's not actually what's causing the
problem. Instead, LengthSort doesn't like the extra run time
information:

@NIKITA_0798:4:1:16662:2376#TGACCA/1 run=110525_SN403_0798_A818L9ABXX

I've noticed that some sequence providers are starting to include
extra information like this. I imagine they think they're being
helpful, but it actually breaks a lot of downstream software. Also,
it's a bit redundant. This header information is repeated for every
read -- this particular header only takes up 33 bytes, but once you
repeat that for a few hundred million reads, it starts adding non-
trivial amounts to your file sizes.

The simplest approach is just to remove this extra information; for
instance, using the following sed command:

sed 's/ run=110525_SN403_0798_A818L9ABXX//g' for.fastq > for_mod.fastq

This command doesn't overwrite the original file, but instead puts the
new correct information into a new file. That said, always work off
copies of your files and preferably test out command lines like these
on small subsets of the data. This command works for me, but it may
behave differently on your installation!

Once those non-standard headers are changed, LengthSort works ok for
me on your data.

Best
-Murray
Reply all
Reply to author
Forward
0 new messages