Hi John,
Thanks for getting in touch.
Yes, as you guessed, you've encountered a problem with the file
format. I'm not sure where this dataset came from, but it's certainly
not in standard Illumina format, which would look something like this:
@NIKITA_0798:4:1:16662:2376#TGACCA/1
@NIKITA_0798:4:1:16662:2376#TGACCA/2
The /3 is particularly strange -- I've never seen that before.
However, funnily enough, that's not actually what's causing the
problem. Instead, LengthSort doesn't like the extra run time
information:
@NIKITA_0798:4:1:16662:2376#TGACCA/1 run=110525_SN403_0798_A818L9ABXX
I've noticed that some sequence providers are starting to include
extra information like this. I imagine they think they're being
helpful, but it actually breaks a lot of downstream software. Also,
it's a bit redundant. This header information is repeated for every
read -- this particular header only takes up 33 bytes, but once you
repeat that for a few hundred million reads, it starts adding non-
trivial amounts to your file sizes.
The simplest approach is just to remove this extra information; for
instance, using the following sed command:
sed 's/ run=110525_SN403_0798_A818L9ABXX//g' for.fastq > for_mod.fastq
This command doesn't overwrite the original file, but instead puts the
new correct information into a new file. That said, always work off
copies of your files and preferably test out command lines like these
on small subsets of the data. This command works for me, but it may
behave differently on your installation!
Once those non-standard headers are changed, LengthSort works ok for
me on your data.
Best
-Murray