STAR generated BAM file with cufflinks

369 views
Skip to first unread message

Aarthi Mohan

unread,
May 23, 2016, 11:58:58 AM5/23/16
to rna-star
Hi all,

I have strand-specific (fr-firststrand) RNA-seq data. 

I am using STAR generated BAM file with cufflinks to find novel genes. I have used the intronMotif options as suggested in the manual, and still cufflinks won't detect my BAM as paired-end, and raising this warning below.

Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.

The gene.tracking.fpkm files does not report all genes. 

Whereas with the tophat generated BAM file, cufflinks is able to detect the paired-end and computes the fragment size from the mates. 

command used: 

STAR --genomeDir Mapping_tools/STAR/v9.0/ --readFilesIn R1.fq.gz R2.fq.gz --readFilesCommand zcat --sjdbGTFfile Genome/v9.0/pf9.0.gtf --sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript gene_id --outFilterMismatchNmax 5 --runThreadN 8 --outSAMattributes All --outSAMstrandField intronMotif --outFileNamePrefix Ring1_STAR --outSAMmode Full --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within


cufflinks -o tophat_cufflinks -g Genome/v9.0/pf9.0.gtf --library-type fr-firststrand -p 16 accepted_hits.bam cufflinks -o STAR_cufflinks -g Genome/v9.0/pf9.0.gtf --library-type fr-firststrand -p 16 Ring1_STARAligned.sortedByCoord.out.bam


Any clue as to why this happens? Could anyone successfully used a paired-end BAM from STAR with cufflinks share the exact commands?


Thanks,

Aarthi




Alexander Dobin

unread,
May 23, 2016, 4:41:50 PM5/23/16
to rna-star
Hi Aarthi,

Please try to omit --outSAMattributes All --outSAMstrandField intronMotif options.
The first one outputs jI and jM that used to interfere with Cufflinks, at least for the older versions.
The 2nd option is not needed since your data is stranded.
Otherwise your commands look OK to me - I have used similar commands with STAR and Cufflinks and it worked fine.

If this does not help, I will try to reproduce the problem, please send me Log.out and Log.final.out files.

Cheers
Alex

Aarthi Mohan

unread,
May 24, 2016, 6:13:24 AM5/24/16
to rna-star
Hi Alex,

Thanks for the help. 

I re-run STAR command by omitting the --outSAMattributes All --outSAMstrandField intronMotif , and cufflinks still cannot detect the paired-end library. 'genes.tracking.fpkm' file has only few genes. 

I am attaching the Log.out and Log.final.out files here.  I have cufflinks version v2.2.1


Thanks,
Aarthi
Ring1_STAR_alexLog.final.out
Ring1_STAR_alexLog.out

Alexander Dobin

unread,
May 24, 2016, 10:42:52 AM5/24/16
to rna-star
Hi Aarthi,

nothing suspicious in the Log files, I need to have a closer look at this case.
Could you send me the link to the genome (Malaria?)?
Could you share your RNA-seq data with me, or do you know of a similar public dataset?

Cheers
Alex

Aarthi Mohan

unread,
May 26, 2016, 11:42:09 AM5/26/16
to rna-star
Hi Alex,

I am not sure if my reply reached you, since I had replied privately. 

I can share the files with you.
Would you need the FASTQ or the BAM. I will also share the FASTA and GTF.

Thanks,
Aarthi

Alexander Dobin

unread,
May 27, 2016, 4:05:57 PM5/27/16
to rna-star
Hi Aarthi,

thanks for the files!

I could reproduce the problem "Using default Gaussian distribution due to insufficient paired-end reads in open ranges."
Moreover, Cufflinks gets stuck at 99% - this usually happens when too many reads map to a "locus".

I have tried several other tricks (hardclipping soft-clips, end-to-end alignments) to no avail, 

One thing that worked was removing alignments with very long gaps (intronic or on-between mates), e.g. --alignIntronMax 10000 --alignMatesGapMax 20000
Note that maximum annotated intron in the GTF is ~2.5kb, so it makes sense to limit the max gap to 10-20kb.
My guess is that these long gapped reads make Cufflinks merge distant genes together and screw up quantifications.

Also, removing multimappers worked - it got it through the fragment length distribution, though it still got stuck at 99% for quantification.
If you mapped that 1M reads with TopHat, could you check how many reads were mapped uniquely and how many as multimappers?

Another thing that worked even without removing long gaps or multimappers, is -G PlasmoDB9.0_Pfalciparum3D7.gtf  insead of -g. The former option just quantifies the annotated transcripts/genes, while the latter tries to assemble novel transcripts. The -G run does not get stuck and goes all the way without a problem. 

Cheers
Alex

Aarthi Mohan

unread,
May 31, 2016, 9:38:13 AM5/31/16
to Alexander Dobin, rna-star
Hi Alex, 

Thanks so much for a detailed explanation. Earlier, I had tried using a MAPQ 255 filtered reads and had the same cufflinks problem. 

But, the problem goes away after using "alignIntronMax 10000 --alignMatesGapMax 20000". I am able to run cufflinks with both '-G' and '-g' modes with successful paired-end detection. 

Currently, Tophat won't run on my system after I updated samtools to new version, but I will definitely check how the mapping goes with Tophat for the 1M reads and report back.

Thanks again for the help!

Best,
Aarthi

--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/dgN2JVaaavA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/rna-star.

Reply all
Reply to author
Forward
0 new messages