Reading fasta files (rather than fastq): "Number of lines in FastQ file is not multiple of 4! EOF found"

Vadim Puller

Mar 10, 2021, 3:53:08 AM3/10/21
Dear NGLess team,

We have been using NGLess for analyzing simulated data. The data come from different simulators: some in fastq and other in fasta format (more precisely as .fq.gz and .fa.gz files), and we assumed by default that NGLess functions `fastq` and `paired` are capable of recognizing the file format. It is only with the most recent batch of fasta files that we run into an error message "Number of lines in FastQ file is not multiple of 4! EOF found".

I have seen a related discussion on your github page, and the issue is easily remedied by adding extra lines to a file or converting it to a fastq format (while adding fake quality scores). We however would like to be on the safe side and ask for a few clarifications:

1. Are functions `fastq` and `paired` capable of recognizing fasta format? (The error message seems to give a clear answer, but the previous batches of data in fasta format were processed without any error messages, despite not containing some elements of fastq format, such as + lines with the quality scores.)
2. If these functions treat fasta as if it were fastq, would they still really treat every record or only every second one? (since fastq format has 4 lines per record, while fasta has only 2) The results obtained with our previous fasta files seem sensible, but we would appreciate a definitive statement from you.


Luis Pedro Coelho

Mar 11, 2021, 11:10:35 PM3/11/21
Dear Vadim,

In principle, yes, the functions should only read fastq files. I would expect that most of the uses would quickly trigger an error downstream. Frankly, I am more surprised that it worked than anything else. What exact downstream steps were you taking?


