Maybe another interesting observation. I just ran a 3 million read simulation and quantification with Sailfish vs RSEM. For the RSEM pipeline I align to the transcriptome with STAR. Sailfish took about 9 minutes and the RSEM-STAR pipeline took about 9 minutes and was also much more accurate. Additionally if I convert the RSEM assigned alignments to genomic and then count hits against a GTF the result is also very accurate...essentially identical to the control counts and the RSEM estimated counts. The comparison was done with gene locus level count summaries (not isoform level counts).
The fact is with STAR around we don't actually need to speed up or avoid the mapping stage.
On Thursday, May 15, 2014 11:55:13 PM UTC-7, Shawn Driscoll wrote:
This is interesting. Are you able to share some of this test data? We actually have two new modes in the development branch of Sailfish that we're testing. One that assigns kmers in groups to increase quantification accuracy with longer reads and a second that actually accepts alignments. I'd be very interested in seeing if either or both of these methods close the accuracy gap you're seeing on your data. The goal is for Sailfish to be both fast and accurate, and I'm interested in anything that can help us achieve that goal.
--Rob
Are you able to share this test data? As I mentioned above in my response to Alex, I'd like to run some of these tests with our current Sailfish improvements to see if we can close any accuracy gap. Also, our read-based inference should still be significantly faster than RSEM's.
--Rob
Cheers,
Rob
You will need to re-generate the STAR genome to run the transcriptome transformation. At the mapping stage, you need to add --quantMode TranscriptomeSAM.
Note that this transformation happens simultaneously with mapping. The transcriptomic alignments are streamed into AlignedToTranscriptome.out.bam file, in addition to the normal alignments in Aligned.out.sam . At the moment the transcriptomic alignments are geared towards RSEM: indels or soft-clipping are not allowed.
You can run STAR and RSEM at the same time through a fifo file like this:
mkfifo AlignedToTranscriptome.out.bam
STAR --genomeDir /path/to/genome/ --readFilesIn Read1.gz Read2.gz --outSAMattributes NH HI --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --quantMode TranscriptomeSAM --runThreadN 12 --readFilesCommand zcat &
rsem-calculate-expression -p 12 --bam --paired-end --no-bam-output --forward-prob 0 --estimate-rspd AlignedToTranscriptome.out.bam /path/to/RSEM/reference RSEM >& Log.rsem
This is still quite experimental and I need to do more thorough testing.
Cheers
Alex
--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/ASsO340hlug/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.
Perfect, thanks Vladimir!
To unsubscribe from this group and all its topics, send an email to rna-star+unsubscribe@googlegroups.com.
Perfect, thanks Vladimir!
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
my problem with the recent builds is just that they segfault... I would be so stoked to use these options... but alas I have regressed to the release :-(
--
To unsubscribe from this group and all its topics, send an email to rna-star+unsubscribe@googlegroups.com.
run #of non-zero read counts average read count stdev read count
test_p0.5_R1.genes.results 1938 8.97149 44.0204
test_p0.5_R2.genes.results 11552 58.0327 311.64
test_p0_R1.genes.results 3 4.66667 5.18545
test_p0_R2.genes.results 2 1.5 0.5
test_p1_R1.genes.results 1938 8.97033 44.0334
test_p1_R2.genes.results 11565 57.9754 311.48
Hi Rob,
Sorry I meant to reply back to this thread earlier. The separation of the STAR+RSEM counts and the Sailfish counts in this case was not much. I revised my post above. In think in this case also the speed difference was not much because I also expressed almost all of the genes in the transcriptome which I assume makes a lot more work for the EM algorithm. I can send you data - it's only simulated reads...maybe similar to what you generate with the FLUX simulator? I always assumed those simulators would generate reads AND control counts for the transcripts it sampled from.
On Tuesday, May 20, 2014 4:47:24 PM UTC-7, Rob Patro wrote:Hi Shawn,Are you able to share this test data? As I mentioned above in my response to Alex, I'd like to run some of these tests with our current Sailfish improvements to see if we can close any accuracy gap. Also, our read-based inference should still be significantly faster than RSEM's.
--Rob