Zipped FastQ Files

2,717 views
Skip to first unread message

Anthony Colombo

unread,
Jan 21, 2014, 4:15:41 PM1/21/14
to rna-...@googlegroups.com
Hello.

I have zipped fastq files but I read on the thread that STAR must have multi line fastA files or single lined FastQ, so I am still in the process of converting the files.

however, does STAR accept zipped fasta files assuming I convert the fastq files correctly?

Thank you
AC

Santosh Anand

unread,
Jan 21, 2014, 4:34:02 PM1/21/14
to rna-...@googlegroups.com
STAR can read from any standard zip-file, although the value of parameter 

"readFilesCommand" must be correctly set while running. From the STAR manual:

readFilesCommand:

string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout

For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.

Anthony Colombo

unread,
Jan 21, 2014, 6:00:23 PM1/21/14
to Santosh Anand, rna-...@googlegroups.com
I am currently having issues with the multi line fastq files.  So i can not test this message yet. but just to clarify, is the readFilesCommand" located in the script on the command line when originally submitting the program?

I submit jobs to HPCC and am inquiring if I just submit the command as a paramater?

would a sample command be:  --readFilesCommand zcat

my input for the reads is : --readfilesin read1.fastq.gz

would this be the correct argument for parameter readFilesCommand?

thank you very much



--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Santosh Anand

unread,
Jan 21, 2014, 6:19:05 PM1/21/14
to rna-...@googlegroups.com, Santosh Anand
> would a sample command be:  --readFilesCommand zcat 
This is correct. "readFilesCommand" is a parameter of STAR (like "readFilesIn")

BTW, what do you mean by "...must have multi line fastA files or single lined FastQ"? STAR should work just fine with fastq or fasta - and both the formats are standard.

Anthony Colombo

unread,
Jan 21, 2014, 10:01:02 PM1/21/14
to rna-...@googlegroups.com
I am having the same errors as this post.


and Mr. Dobin mentioned that the "multi-line fastQs" are the problems for STAR, and he recommends FASTA format.  Am I misunderstanding?

Thank you

AC

Santosh Anand

unread,
Jan 22, 2014, 4:44:59 AM1/22/14
to rna-...@googlegroups.com
Can you post here a few lines of your fastq (head *.fastq)?

Alexander Dobin

unread,
Jan 22, 2014, 3:39:39 PM1/22/14
to
Hi Anthony,

Santosh's suggestion is right to the point - if you add  --readFilesCommand zcat  to the STAR command line parameters, it will process the zipped FASTQ file.

STAR does not support multi-line FASTQ files. There is a discussion going on whether FASTQ format allows for multi-line reads.
On the other hand, STAR supports multi-line FASTA, so if you convert your multi-line FASTQ into FASTA, it should be OK.

Cheers
Alex

Santosh Anand

unread,
Jan 22, 2014, 5:44:42 PM1/22/14
to
> STAR does not support multi-line FASTQ files. There is a discussion going on whether FASTQ format allows for multi-line reads.

However, It is not difficult to convert them to "more standard" single-line fastQ format using "seqtk"
seqtk seq multi-line.fq > standard-single-line.fq

There are supposedly other tools  which can do this conversion:
(....In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't...)



On Wednesday, January 22, 2014 9:34:34 PM UTC+1, Alexander Dobin wrote:
Hi Anthony,

Santosh's suggestion is right to the point - if you add  --readFilesCommand zcat  to the STAR command line parameters, it will process the zipped FASTQ file.

STAR does not support multi-line FASTQ files. There is a discussion going on whether FASTQ format allows for multi-line reads.
On the other hand, STAR supports multi-line FASTA, so if you convert your multi-line FASTQ into FASTA, it should be OK.

Cheers
Alex

On Tuesday, January 21, 2014 6:00:23 PM UTC-5, Anthony Colombo wrote:

Anthony Colombo

unread,
Jan 22, 2014, 9:02:06 PM1/22/14
to Santosh Anand, rna-...@googlegroups.com
Sorry it took so long to reply.  Here is the head of my file.

I assume this is the multi-Line fastQ file that is problematic for STAR is this correct?  does I need to do the fastq conversion?

@SN860:381:H80WNADXX:1:1101:1496:1970 1:N:0:3
TTCCATCTTGTGATCCATTCTTGTGCATTCTTCACTTCTTGAGTCACTCCCAAAATCCATTTGTATTGTTACTCCTCGACCAAAAAGGACCAGAACAAAAAGTTTACTTCAATTGTTCCCATAGGAAACTCAG
+
FFFIIIIIIIFIIIIIIIIIIIFFFFIIIIIIIFFIFFFIFFFBFBFFFIFIIIFFIFBFFFIBFFFBFFBFBFFBFFFFFFFFFBFB<BBBBBBBBFFFFBFFBB0<<BB<B<BBBBFBBBBB<BB<7<0<0
@SN860:381:H80WNADXX:1:1101:1675:1966 1:N:0:3
GGTGATGTACTCCACGTAAGCGATGGCATCTTCCACACGGCGCTTCTTTGTACAGAGGATGATGCGGTTGAAGTTGGCGTAGATGACTGATGACCCCAGGCGCTTGAACTCAGCGATGAGCTGCAGGAAGAG
+
BFBFFFFIIIIIIFIIFFFIBFFFFIIIBBFBFFFBFFF<7BBFFFFFFF<BB<BBFB<<BBBBBBBFFBB<BBFFF<<07<7<0<<BBBBBBBBBB7BBB<<BBBBBBBBBBB7007<00070<BBB7700
@SN860:381:H80WNADXX:1:1101:2217:1967 1:N:0:3
CTGTCATAATCTTCTTGTCCAGCTGTATCCCATAAGCCCAGATTCACCGGTTTTCCATCTACCATAACATTGGCAGAATAATTGTCAAAGACAGTAGGGATATATTCTCCAGGAAATGC


Anthony Colombo
University of Southern California
Applied Mathematics (B.S.)  and Physics (Minor)
Fall '14

On Jan 22, 2014, at 1:46 PM, Santosh Anand wrote:

> STAR does not support multi-line FASTQ files. There is a discussion going on whether FASTQ format allows for multi-line reads.

However, It is not difficult to convert them to "more standard" format using "seqtk"
seqtk seq multi-line.fq > standard.fq

There are supposedly other tools  which can do this conversion:
(....In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't...)



On Wednesday, January 22, 2014 9:34:34 PM UTC+1, Alexander Dobin wrote:
Hi Anthony,

Santosh's suggestion is right to the point - if you add  --readFilesCommand zcat  to the STAR command line parameters, it will process the zipped FASTQ file.

STAR does not support multi-line FASTQ files. There is a discussion going on whether FASTQ format allows for multi-line reads.
On the other hand, STAR supports multi-line FASTA, so if you convert your multi-line FASTQ into FASTA, it should be OK.

Cheers
Alex

On Tuesday, January 21, 2014 6:00:23 PM UTC-5, Anthony Colombo wrote:

Santosh Anand

unread,
Jan 23, 2014, 4:37:44 AM1/23/14
to rna-...@googlegroups.com, Santosh Anand
As far as multi-line is concerned, this fastq is correct as the bases or qual-score are not spread over multiple lines. So you can just try feeding it to STAR. But if it gives some error, you can try seqtk 

seqtk seq multi-line.fq > standard-single-line.fq

seqtk might be already installed on your server or can be dloaded from here

Anthony Colombo

unread,
Jan 30, 2014, 11:54:07 AM1/30/14
to Santosh Anand, rna-...@googlegroups.com
Hello.

Thank you in advance for your help. 

Here is the head of my file

[acolombo@hpc-login2 Sample_890129]$ head unzip_890129.fastq
@SN1083:317:H7WHAADXX:1:1101:1605:2247 1:N:0:6
AAATAACCAGCCTTGAGAGGGCGCAGGACCACAGTGTGGGAGACATTGCTAGCAGGGGCAATCCGGTCCCATTTGACATTGAGCATTCCAGACACAATGCCAAAGTCTTCTGGAGGGAAGGAATCATC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFFFF
@SN1083:317:H7WHAADXX:1:1101:2147:2230 1:N:0:6
TGCGATGAGTAGGGGAAGGGAGCCTACTAGGGTGTAGAATAGGAAGTATGTGCCTGCGTTCAGGCGTTCTGGCTGGTTGCCTCATCGGGTGATGATAGCCAAGGTGGGGATAAGTGTGGTTTCGNAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFFBBFFFFFFFFFFFFF<BFFFFFFFFFFFFFFFFFF#07B
@SN1083:317:H7WHAADXX:1:1101:2520:2248 1:N:0:6
GTTGCCTGGCTGGCCCAGCTCGGCTCGAATAAGGAGGCTTAGAGCTGTGCCTAGGACTCCAGCTCATGCGCCGAATAATAGGTATAGTGTTCCAATGTCTTTGTGGTTTGTAGAGAATAGTCAGATCG




here is the align output



---------------------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Jan 28 13:42:31 ..... Started STAR run
Jan 28 13:48:40 ..... Started mapping

EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or > 

Jan 28 13:48:40 ...... FATAL ERROR, exiting



I am not sure why this error is happening if my files are in the correct format.  I will download seqtk, but I am wishing to understand why.

Thank you

AC

Anthony Colombo

unread,
Jan 30, 2014, 12:01:35 PM1/30/14
to Santosh Anand, rna-...@googlegroups.com
Sorry to slam the inbox.  Here is the code the alignment parameters



Log.out

Alexander Dobin

unread,
Jan 31, 2014, 11:25:51 AM1/31/14
to rna-...@googlegroups.com, Santosh Anand
Hi Anthony,

the beginning of your file looks all right, however, the problem might be somewhere in the middle. Your run parameters look all right. Can you try to run STAR on the raw data, without any trimming? If that goes through, it will point to the problem with trimming.

Cheers
Alex

Anthony Colombo

unread,
Mar 25, 2014, 1:49:53 AM3/25/14
to rna-...@googlegroups.com

Anthony Colombo

unread,
Mar 25, 2014, 2:06:17 AM3/25/14
to rna-...@googlegroups.com, Santosh Anand, ado...@gmail.com
Hello.  I am sorry for the late reply but I have no returned to this problem after a detour into other problems.

Okay, so long story short this is what I have done to work around this issue.  Note -  I am getting the same error message when Running star on reads trimmed using Cutadapt software

What I 've done
1) Unzipped the raw reads and use STAR on the unzipped Raw reads

I did a test run and it is running for over ten minutes,  the original error in the Log file errors out after a few seconds, so STAR appears to be running correctly on the RAW READS that are unzipped.  note-  I did have the zipped raw reads but got the same error and yes I used the --readFilesCommand zcat.

So I am seeking general guidance to what to look for regarding inspection of the trimming adapter process.  How do I look for hidden inserted characters?  Should I run seqTk to transform the fastq to fasta to test any data corruption?

My guess is that I picked up a bug somewhere in the trimming process.   What trimming software do you recommend?  
  
My data trimmed by Trimmomatic and mapped with Tophat2 works fine.  However I am in need to use "cut-adapt" and STAR for testing.

Thank you very much

Anthony C
Log.out

Anthony Colombo

unread,
Mar 25, 2014, 3:40:21 AM3/25/14
to rna-...@googlegroups.com
Yes I am now running my trimmed results, and find that STAR is giving me errors for the zipped data, but not the unzipped fastq files. 

here is my command line

/auto/rcf-proj/sa1/software/STAR_2.3.1v/STAR --genomeDir /auto/rcf-proj/sa1/data/STAR/hg19/Genome_With_Annotations --runMode alignReads --sjdbGTFfile /auto/rcf-proj/sa1/data/Homo_sapiens1/UCSC/hg19/Annotation/Archives/archive-2013-03-06-11-23-03/Genes/genes.gtf --runThreadN 16 --readFilesIn trim_Sample_CHLA-15_R1.fastq.gz trim_Sample_CHLA-15_R1.fastq.gz --readFilesCommand zcat --genomeChrBinNbits 18 --genomeSAindexNbases 14 --genomeSAsparseD 1 --readMatesLengthsIn NotEqual --outReadsUnmapped Fastx --outSAMmode Full

what is the best line for processing zipped?


On Tuesday, January 21, 2014 1:15:41 PM UTC-8, Anthony Colombo wrote:

Alexander Dobin

unread,
Mar 25, 2014, 5:26:07 PM3/25/14
to rna-...@googlegroups.com
Hi Anthony,

your command seems to be OK for processing zipped file. Could you send me these files? I am afraid this is the only way to figure out what is going on.
Some users reported problems with trimmed files in cases where reads were trimmed to 0 lengths. However, I cannot see how this would be affected by zipping.

Cheers
Alex

Anthony Colombo

unread,
Mar 26, 2014, 3:55:08 PM3/26/14
to Alexander Dobin, rna-...@googlegroups.com
thank you for the reply.

Would you like the actual reads/fastq files ? or the log files?  

Also I ran a cutadapt trim that had minimum length of 10bp so  I highly doubt that there exists 0 bp segments (this would need to be confirmed)



Anthony Colombo
University of Southern California
Applied Mathematics (B.S.)  and Physics (Minor)
Fall '14

Anthony Colombo

unread,
Mar 26, 2014, 4:09:48 PM3/26/14
to Alexander Dobin, rna-...@googlegroups.com
Here are the alignment files.  I am not sure if you need the actual reads, and am not sure how to send a large .gz file.  I can send you a dropbox link if needed for the real data.



Aligned.out.sam
Log.out
Log.progress.out

Alexander Dobin

unread,
Mar 28, 2014, 3:39:14 PM3/28/14
to rna-...@googlegroups.com, Alexander Dobin
Hi Anthony,

I cannot see any explanation for the error in these output files.
It appears that the problematic lines are very close to the beginning of your fastq.gz files, since no alignments are written into Aligned.out.sam file.
If the fastq.gz files are too big for transfer, you can try to cut a few first hundred thousand lines, gzip them, and try to run STAR. 
If you see the same error, you can send me these reduced fastq.gz files, as well as the Log.out file.

Cheers
Alex

Anthony Colombo

unread,
Mar 28, 2014, 6:42:22 PM3/28/14
to Alexander Dobin, rna-...@googlegroups.com
STAR_reduced_R1.fastq.gz
STAR_reduced_R2.fastq.gz
Log.out
STAR_map.pbs.o7722621

Alexander Dobin

unread,
Mar 31, 2014, 7:14:20 PM3/31/14
to rna-...@googlegroups.com, Alexander Dobin
Hi Anthony,

I am sorry, I should have seen this problem in your Log files, but it skipped my mind completely. For gzipped files, you have to use --readFilesCommand zcat option, otherwise it assumes uncompressed files.
I have run STAR on your files with this option and it completed successfully. 

Please let me know if it worked on the full file.
Cheers
Alex

Akhil Pampana

unread,
Nov 15, 2016, 11:35:48 AM11/15/16
to rna-star, ado...@gmail.com
hey alex why its taking only zipped files as input not direct fastq files??

Alexander Dobin

unread,
Nov 17, 2016, 3:14:45 PM11/17/16
to rna-star, ado...@gmail.com
Hi Akhil,

STAR takes the unzipped files by default, if you do *not* use --readFilesCommand zcat option.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages