Re: Splicing analysis with STAR followed by cuffdiff

Alexander Dobin

unread,

May 16, 2013, 12:30:27 AM5/16/13

to rna-...@googlegroups.com

Hi Diya,

STAR does not perform differential splicing analysis on its own, however, it can detect alternative splice junctions, both novel and annotated.

Detecting statistically significant differential alternative splicing is a complicated problem and many approaches besides Cufflinks have been published (DiffSplice, Splicing Compass, DExseq). I unfortunately do not have any serious experience running them and cannot recommend one over another.

For genome generation, if you are using annotations in .gtf. format, you need

--sjdbGTFfile /data1/dbs/iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf

I have had some problems with .gtf files downloaded from UCSC, sometimes they have the same transcript IDs for transcripts on different chromosomes - you would need to check it and possibly add cromosome names to transcript IDs.

You can also use --runThreadN 3 to speed-up genome generation,

For mapping step, I would recommend using --outFilterType BySJout

Cheers

Alex

On Wednesday, May 15, 2013 3:43:20 PM UTC-4, Dedeepya Vaka wrote:

Hi Alex,

You have been quite helpful with the questions which we had while we started to use STAR. Now that we are interested in splicing analysis I want to understand how well does STAR perform in finding the alternate splicing events? Does star do something like differential splicing analysis. I am asking this because, if we just give the .sam outputs from star to cuffdiff for downstream analysis we see there are no significant splicing events occurring so I am not sure if the output which we are loading to cuffdiff from STAR has any information regarding splicing or not

here is the command which we used to generate the genome and the mapping

Generating genome:

###############################################################################################################################################################
/data2/Tools/STAR_2.3.0e.Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /data1/dbs/star_genomes_from_igenomes --genomeFastaFiles /data1/dbs/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa --runThreadN 3 --sjdbFileChrStartEnd /data1/dbs/iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf --sjdbOverhang 99
###############################################################################################################################################################

Aligning reads to the above generated genome:

###############################################################################################################################################################
/data2/Tools/STAR_2.3.0e.Linux_x86_64/STAR --genomeDir /data1/dbs/star_genomes_from_igenomes/ --readFilesIn ../f1_pf.fastq ../f2_pf.fastq --runThreadN 3 --outFileNamePrefix ./star_output_f --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonicalUnannotated
###############################################################################################################################################################

I really appreciate your help on this.

Thanks,

Diya

Dedeepya Vaka

unread,

May 16, 2013, 1:52:12 AM5/16/13

to rna-...@googlegroups.com

Hi Alex,

Thanks for the reply. So you mean that the output .SAM file should be having all the splicing information, which other programs will be able to find it when they use it..

Diya

Shawn Driscoll

unread,

May 16, 2013, 3:51:33 AM5/16/13

to rna-...@googlegroups.com

RNA seq reads, since they are sampled from transcript sequences, aligned to the genome with STAR or Tophat will reveal the positions of introns in the form of split alignments. These are reads that align partially on either side of an intron. Cufflinks uses these alignments along with some fancy math can TRY to estimate isoform level expression. This of course tends to fail miserably in complex gene loci and almost everywhere when coverage is low*. In my experience the most information you have is that provided by the aligner, such as STAR, and your knowledge of the transcriptome. Some exons are alternatively spliced and others are not. You can quantify the amount an exon is included verses how much it is skipped from the alignments and assign it an inclusion ratio. Then you can compare those ratios between samples to find evidence of possible differential splicing. I'd recommend finding an analysis based on this type of approach rather than one that claims to be able to quantify entire isoform expression levels such as cufflinks/cuffdiff. To really do this you need a massive amount of read coverage. Since count values are shaky at low levels you might require as many as 10 to 20 hits at a junction before testing for differential inclusion of an Exon. In my own analysis it meant that I could only analyze about 8% of the potential differential spicing sites in the mouse with about 20M 100 base paired end reads.

*based on my own simulations

Reply all

Reply to author

Forward