invalid fastq files?

4,623 views
Skip to first unread message

Andrew Jaffe

unread,
Jul 31, 2013, 9:09:32 AM7/31/13
to rna-...@googlegroups.com
first time using this software, getting a strange error regarding invalid fastq files. they were also deemed valid by tophat, but since they are longer reads, tophat eventually crashed...

> STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles Drosophila_melanogaster/UCSC/dm3/Sequence/WholeGenomeFasta/genome.fa --sjdbOverhang 100 --sjdbGTFfile Drosophila_melanogaster.BDGP5.72.gtf
Jul 30 15:53:04 ..... Started STAR run
Jul 30 15:53:04 ... Starting to generate Genome files
Jul 30 15:53:13 ... starting to sort  Suffix Array. This may take a long time...
Jul 30 15:53:15 ... sorting Suffix Array chunks and saving them to disk...
Jul 30 15:58:59 ... loading chunks from disk, packing SA...
Jul 30 15:59:04 ... writing Suffix Array to disk ...
Jul 30 15:59:26 ... Finished generating suffix array
Jul 30 15:59:26 ... starting to generate Suffix Array index...
Jul 30 16:00:50 ... writing SAindex to disk
Jul 30 16:01:17 ..... Finished successfully

> STAR --genomeDir genomeDir --readFilesIn sample1.fastq --sjdbGTFfile Drosophila_melanogaster.BDGP5.72.gtf --sjdbScore 2 outFilterMismatchNmax 20
Jul 31 09:06:18 ..... Started STAR run
Jul 31 09:06:22 ..... Started mapping

EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or >

Jul 31 09:06:22 ...... FATAL ERROR, exiting

Same error with the second sample. Both appear to be "valid" fastq files according to FastQValidator

[ajaffe@compute-0-5 bin]$ ./fastQValidator --file sample1.fastq
Finished processing sample1.fastq with 683804 lines containing 170951 sequences.
There were a total of 0 errors.
Returning: 0 : FASTQ_SUCCESS
[ajaffe@compute-0-5 bin]$ ./fastQValidator --file sample2.fastq
Finished processing sample2.fastq with 697548 lines containing 174387 sequences.
There were a total of 0 errors.
Returning: 0 : FASTQ_SUCCESS

Alexander Dobin

unread,
Jul 31, 2013, 6:48:40 PM7/31/13
to rna-...@googlegroups.com
Hi Andrew,

by the counts the fastqValidator gives it seems that you have multi-line fastq files, i.e. the sequence or quality can be split into different lines.
STAR cannot process the multiline .fastq, since, according to Wikipedia:
"The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string)."
If you can convert these files to .fasta (those could be multi-line if you wish), or single-line .fastq, STAR will be able to deal with them.

How long are you reads? If they are longer than 300-400b, you may want to try compiling STAR with 'make STARlong'.
Also, I typically use the following parameter for the longer reads:
--outFilterMismatchNmax 100   --seedSearchLmax 20   --seedSearchStartLmax 20   --seedPerReadNmax 100000   --seedPerWindowNmax 100   --alignTranscriptsPerReadNmax 100000   --alignTranscriptsPerWindowNmax 10000

Cheers
Alex

Xiujun Zhang

unread,
Aug 25, 2014, 1:46:28 AM8/25/14
to rna-...@googlegroups.com
Hi Andrew,
Have you resolved the problem? I  encounter the problem same with you.
ERROR_0001: EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or >

XJ
xj.z...@ntu.edu.sg

在 2013年8月1日星期四UTC+8上午6时48分40秒,Alexander Dobin写道:

Alexander Dobin

unread,
Aug 26, 2014, 12:07:33 PM8/26/14
to rna-...@googlegroups.com
Hi XJ,

are your files gzipped? If so, you need to use --readFilesCommand zcat
If this does not solve the problem, someting is wrong with your read formatting. Have you trimmed your reads? This sometimes screws with the formatting.

Cheers
Alex

jxz...@case.edu

unread,
Oct 6, 2014, 4:55:19 PM10/6/14
to rna-...@googlegroups.com
Hi Alex,
I ran into the same error message but I did invoke the --readFilesCommand zcat.
I also checked the files with zless and it seemed that the sequences did start with @.  The first few lines looked like this.
@HWI-ST152R_0409:5:1:1451:1993#NAGCTT/1
GCTGTATCTCTCAGGATTATCACTGATCACACATCCAACCAGTGCCAGCCAAAAGGATGCCCTGAGGCAAAGGGT
+HWI-ST152R_0409:5:1:1451:1993#NAGCTT/1
bbd_dee`dcdfefffffeffffffaeeeefdfffffefffeeefeffffffefcffcefffeffadfedbbTcU
@HWI-ST152R_0409:5:1:1587:1992#TAGCTT/1
GGCCATCTGATCTATAAATGCGGTGGCATCGACAAAAGAACCATTGAAAAATTTGAGAAGGAGGCTGCTGAGATG
+HWI-ST152R_0409:5:1:1587:1992#TAGCTT/1

Following is copied from the Log.out file of the run.
##### Final parameters after user input--------------------------------:
versionSTAR                       20201
versionGenome                     20101   20200
parametersFiles                   -
runMode                           alignReads
runThreadN                        3
genomeDir                         /data/STARgenomes/ENSEMBL.homo_sapiens.release-75
genomeLoad                        NoSharedMemory
genomeFastaFiles                  -
genomeSAindexNbases               14
genomeChrBinNbits                 18
genomeSAsparseD                   1
readFilesIn                       /home/zhaoj2/LiXLab/genentech/rnaseqdata/587352_1_1.fastq.gz   /home/zhaoj2/LiXLab/genentech/rnaseqdata/587352_1_2.fastq.gz
readFilesCommand                  zcat
Nothing else was re-defined.
What do you think might be the problem ?

Thanks a lot.
J.Z

Alexander Dobin

unread,
Oct 13, 2014, 12:06:30 PM10/13/14
to rna-...@googlegroups.com
Hi Junjie,

could you please try to unzip a portion of your file, and see if STAR can map it (without --readFilesCommand zcat, of course).
Then re-zip it and try to map again. If one of these operations fails, please send me the smallest fastq where you can still see this error, and also you Log.out file.
Also, switching to the latest STAR patch might be a good idea (https://github.com/alexdobin/STAR/releases).

Cheers
Alex

Rory Kirchner

unread,
Oct 20, 2014, 10:36:51 AM10/20/14
to rna-...@googlegroups.com
Hi Alex,

We've encountered the same issue, I attached a small two-read fastq file that reproduces the error. I think it is caused by read IDs that have spaces in the names, because if I replace the spaces with dashes, it works okay.

Best,

Rory


On Wednesday, July 31, 2013 9:09:32 AM UTC-4, Andrew Jaffe wrote:
broken.fq

Rory Kirchner

unread,
Oct 20, 2014, 10:50:52 AM10/20/14
to rna-...@googlegroups.com
This is fixed in the 2.4.0d release, you are psychic. 


On Wednesday, July 31, 2013 9:09:32 AM UTC-4, Andrew Jaffe wrote:

Susanne Lorenz

unread,
Oct 21, 2014, 9:00:43 AM10/21/14
to rna-...@googlegroups.com
Hi,

I have a similar problem with the fastq files. I have fastq files with very long reads from MinION (Oxford Nanopore) and STAR gives an error message saying the sequence string length is not equal the quality string length, but it is.
I have check ed the fastq file by running BWA and this works, except of course it can`t map the long reads. This is the message I get:

Oct 21 11:32:32 ..... Started STAR run

Oct 21 11:33:27 ..... Started mapping

EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length

@channel_121_read_99

TTATTCAGAACTTCTGAAATAGAATTTCTGCCCTGCGACTCCAGTAGAACGCAGGCACAGGCATAGTGTAGCGCTTTATATGCAATGTCCTCGTTGTATGATTGGCTGAGGTAAGGATGGATTCCGAAAACGCATGTAAGAGGACAGACAGTGTCAGGTCTTGTCTTAGGGCTTCAACACGTACACAGGAAAGCCATTTCCGGTATGTAGTTGACGGAAATCGCCGGTGCACGATCGGGAGCGAACTGATCCTGACAAACCCCACTGCCCGTTTCATTTCCTTTCGTCTTAATATTTACGTCAGATCAGCAGCCGACTCTGGCTCAATTTTGCATTGTCGACAGAGGCCAGCCATACTCAACCGATGAGCCTGTCATGTCCCAAGATTCTTATCAGCTGTCGTCTGTCGGCTCCGATTCTTACTGCGTACTATCACTTCTGCTTGTTTCCGAAGTATCCTCCATCACCGTCCCCCAGGCAATCCTTCGAGCTGGCGCGCA

SOLUTION: fix your fastq file

Oct 21 11:33:27 ...... FATAL ERROR, exiting

Does anyone have an idea how to fix that issue?


Alexander Dobin

unread,
Oct 22, 2014, 2:58:08 PM10/22/14
to rna-...@googlegroups.com
Hi Susanne,

First of all, there is no hope that STAR in its present shape will map Oxford Nanopore reads - the error rate is just too high.
In general, if you have read longer than ~300b, you need to compile STAR with "make STARlong".
Also, STAR does not work with multi-line FASTQ (i.e. those where seuqences are split on many lines - you would need to convert it to FASTA, which can be multi-line, or to single-line FASTQ.

Cheers
Alex

Daren Card

unread,
Oct 13, 2015, 12:51:47 PM10/13/15
to rna-star
Hi all,

I am having the same issue with fastq input file formating as is outlined above. I am working with STAR v. 2.4.2a. The error I get is as follows:

"ERROR_00201: EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or >"

Some background:
1. My reads are all pulled down from the SRA and converted into fastq.
2. I've tried running STAR with both gzipped and bzipped input, but all efforts reported below were with unzipped fastq files.

The raw SRA reads look like this. You can see that there is no read/quality-score wrapping, and the headers are standard for SRA fastq (based on Wikipedia).

@SRR805129.1 HWI-ST485:135522712:C1T0BACXX:1:1101:1210:2112 length=100
CAGCTGCAGACCCAGATGATGGAAGAAGGGGAAAAGGTGAAGGAAAAACTCAAGAGGGAGCTGGAGCAGCTGCAGGCAGATGTTGCTCCCTTTCTGGTGG
+SRR805129.1 HWI-ST485:135522712:C1T0BACXX:1:1101:1210:2112 length=100
@BCDFFFFHFHHHJIDIJEHIJGGHAGHIJJFGGIJJHHGJIJIIJIGIHIIJJJJJJIJHHHHFFFFEEEEACDBDDDDDDDCDCDCDDDDDCDCCC?B
@SRR805129.2 HWI-ST485:135522712:C1T0BACXX:1:1101:1190:2137 length=100
CTCCAATGGCGACCCAAAATGCCAAAGTGATTCCAGTAGCAAGACCACCAAAGGCACCTTTCCAATTGGCACAAGGAAAAATAATTCACAGTGTAAATAC
+SRR805129.2 HWI-ST485:135522712:C1T0BACXX:1:1101:1190:2137 length=100
???D=D>3B?C:FG1C;<;?GEEHD>>?@4??990:09BBB>>;)?88;B(6;C9=D()=EHEECED;;).;;AC=(;>=955@################
@SRR805129.3 HWI-ST485:135522712:C1T0BACXX:1:1101:1106:2179 length=100
TCACTCAGAATTGAGTTTTTGTTATGGTTTGATTAAGTGTGTATCCTGTAAATAATGGGAATCAGTGTGTTAGTCCCCCTATGATGGCAAAGACGGCCCC
+SRR805129.3 HWI-ST485:135522712:C1T0BACXX:1:1101:1106:2179 length=100
@@@DDFDFBFFHHHHFHIJJIHJAHHIIIIH>GHEEHGHIGHGIGJIGGFHGICGGGIIIIJFHGIIGIHJIJJDGHIFHECDEFCDEDACC>=@B88?@

I've quality trimmed the reads using Trimmomatic, which results in the same header format. Some reads are completely excluded and some trimmed to a shorter length. The length variable in the headers is no longer accurate in some instances, but given the error above this does not seem to be the issue. These quality-trimmed reads produced my initial instance of the above error. I've since investigated whether characters in the headers produced these issues and removed any spaces, equal signs, and periods and replaced them with underscores or dashes. Still receiving the same errors. The last set of reads I tried looked like this:

@SRR805129-1_HWI-ST485:135522712:C1T0BACXX:1:1101:1210:2112_length-100
CAGCTGCAGACCCAGATGATGGAAGAAGGGGAAAAGGTGAAGGAAAAACTCAAGAGGGAGCTGGAGCAGCTGCAGGCAGATGTTGCTCCCTTTCTGGTGG
+SRR805129-1_HWI-ST485:135522712:C1T0BACXX:1:1101:1210:2112_length-100
@BCDFFFFHFHHHJIDIJEHIJGGHAGHIJJFGGIJJHHGJIJIIJIGIHIIJJJJJJIJHHHHFFFFEEEEACDBDDDDDDDCDCDCDDDDDCDCCC?B
@SRR805129-3_HWI-ST485:135522712:C1T0BACXX:1:1101:1106:2179_length-100
TCACTCAGAATTGAGTTTTTGTTATGGTTTGATTAAGTGTGTATCCTGTAAATAATGGGAATCAGTGTGTTAGTCCCCCTATGATGGCAAAGACGGCCCC
+SRR805129-3_HWI-ST485:135522712:C1T0BACXX:1:1101:1106:2179_length-100
@@@DDFDFBFFHHHHFHIJJIHJAHHIIIIH>GHEEHGHIGHGIGJIGGFHGICGGGIIIIJFHGIIGIHJIJJDGHIFHECDEFCDEDACC>=@B88?@

I'm pretty stumped, so if anyone sees something I am missing, please let me know. I appreciate any help people can provide.

Thanks,
Daren

Alexander Dobin

unread,
Oct 14, 2015, 7:04:37 PM10/14/15
to rna-star
Hi Daren,

the read you have posted here look OK, but the problematic reads might be somewhere farther from the beginning.
Please send me the Log.out file. Also, could you please try 
1. Map without trimming the reads. Trimming often creates problems.
2. Try to map a subset of reads with --readMapNumber option. You can use binary search to find the offending read. :)

Cheers
Alex

Daren Card

unread,
Nov 18, 2015, 5:38:27 PM11/18/15
to rna-star, do...@cshl.edu
Hi Alex,

Thanks for the reply. Finally had a chance to get back to this. I've attached the Log.out file to this email.

1. I mapped the raw reads from the same sample, so it is definitely an issue with the read trimming.

2. I'll work on a binary search for the problematic read.

Let me know if you have any further guidance based on the Log file.

Thank you,
Daren
SRR805129_LowerDigestiveTractLog.out

Alexander Dobin

unread,
Nov 19, 2015, 5:02:08 PM11/19/15
to rna-star, do...@cshl.edu
Hi Daren,

the Log.out looks all right, not hint of the problem there. Most likely, it's an empty sequence produced by the trimming.
Do you require a minimum length after trimming?

Cheers
Alex

Trevor Conley

unread,
Oct 11, 2016, 11:40:21 AM10/11/16
to rna-star
Hello Alex,
   I recently started receiving this error. What is interesting about it is that while my files are zipped fastqs, if I add in the command --readFilesCommand zcat, the processing does not even start. 

   I am working off of a script that someone else wrote years ago and they wrote --readFilesCommand -dc and that too does not work. 

   If I removed either of those from the command, the processing starts and then I receive the error. Any idea what could be causing this? I do not perform any trimming on the sequence; I am taking the straight fastq.gz file that is provided to me from the people who perform the sequencing. 

Thank you,
Trevor

Alexander Dobin

unread,
Oct 14, 2016, 6:47:40 PM10/14/16
to rna-star
Hi Trevor,

please send me the Log.out file.
The --readFilesCommand uses the "fifo" files and some storage partitions (e.g. VFAT) do not allow fifo files.
Please try process substitution, i.e. --readFilesIn <(zcat read1.fq.gz) <(zcat read2.fq.gz) and no --readFilesCommand, 
If this works, it would mean that the problem is with fifo files.

Cheers
Alex

Trevor Conley

unread,
Nov 29, 2016, 4:38:42 PM11/29/16
to rna-star
Hello Alex,
   So far this solution is working. Would ordinary software updates potentially have been the issue as to why the problem would have occurred to begin with? I have run the same command numerous times without ever having a problem and then the problem suddenly arose. 

Thanks,
Trevor

Alexander Dobin

unread,
Dec 5, 2016, 4:08:44 PM12/5/16
to rna-star
Hi Trevor,

if this problem is caused by "fifo" files, it should not be affected by software updates. but rather by the properties of the partition that you run your STAR jobs on.
Could you try the following from within one of STAR run directories (where STAR failes to run):

$ mkfifo test

Does it throw an error?

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages