Mapping to transcriptome with Bowtie, instead of genome mapping with STAR

4,116 views
Skip to first unread message

Daniel Fernandez

unread,
Dec 3, 2013, 2:13:27 PM12/3/13
to rna-...@googlegroups.com
Dear Alex, 

Please don't take me wrong here, I am super pleased with STAR alignments but some good software for Differential expression and downstream analysis, such as BitSeq, RSEM and MMSEQ use Bowtie to map RNA-Seq data directly to the transcriptome - i.e., they build a fasta file with all the transcripts in a GTF table, then index it and then map it with bowtie1.

I was wondering about your experience regarding this approach and how would it compare to an STAR alignment. My intuition tells me that there's something odd with this and that when using long (100 bp or more) paired-end data this could be a major issue, but it's just my intuition no formal proof.  I was wondering what are your thoughts regarding this bowtie1 transcriptome mapping approach?

On a second, but closely related note, I was wondering how one can make STAR alignments work with such tools and came to my mind that converting the STAR alignments to transcriptomic coordinates could be a feasible solution for most of this softwares.  Have you had any experience on converting coordinates efficiently?  I tried some code for doing this in C but it's completely inefficient due to two major bottlenecks: intersecting each read with the transcript coordinates and reading the read sequence from the bam files.

Let me know if you have some good suggestions regarding this issues.

Daniel F.


Daniel Fernandez

unread,
Dec 3, 2013, 8:15:18 PM12/3/13
to rna-...@googlegroups.com
Alex, I have done more research about it and seems most expression quantification and DE tools such as Express, RSEM, bitseq, MMSEQ, etc. all use bam files in transcriptomic coordinates making it hard to use STAR for quantifying expression with any other tool... do you think there may be a way to report STAR in transcriptomic coordinates?  just wanted to add this to the discussion, thanks.

Alexander Dobin

unread,
Dec 5, 2013, 11:55:24 AM12/5/13
to rna-...@googlegroups.com
Hi Daniel,

it is possible to use STAR to map reads directly to the transcriptome - basically, you build a genome with a fasta file of all transcripts sequences. At the mapping stage you would need to prohibit splicing, since the transcript sequences are already spliced. Some programs have their idiosyncratic requirements, like not allowing indels or soft-clipping - please see the discussion of how to deal with these requirements here.

Unlike the software that you mentioned, HTseq and Cufflinks (which are probably the most popular quantification software) take the genomic alignments of RNA-seq.

In my opinion, there are dangers in mapping the reads only to the annotated transcript sequences, since a large number of reads (~25%) usually map outside of spliced transcripts (mostly in the introns). These reads will be "forced" to map to the transcriptome possibly creating false positives. I do not know the extent of this problem, and I think it's an interesting question for someone to explore. I believe the original Cufflinks paper emphasized the importance of accounting for unannotated transcripts for accurate quantification, however, I do not think the software you mentioned before (including eXrpess) addresses this question in any way.

I like the idea of making STAR output the alignments in the transcriptomic coordinates even if it maps the reads to the whole genome. I am planning to work on that next year (and other simple quantification features), after I finish with BAM sorting and some simple features.

Cheers
Alex

Daniel Fernandez

unread,
Dec 5, 2013, 2:08:16 PM12/5/13
to rna-...@googlegroups.com
Hi Alex, thanks a lot for such detailed response.  I basically agree with all of your statements.  With respect to quantifying expression I am more comfortable using transcript-based models than count-based models since they model the data at the biological and data generating reality (a transcript).  That's the reason I mentioned the methods above - however, you are right, cufflinks does use transcript-level approach as well, not sure about HTSEQ (have the feeling it just lumps up reads at the "gene" level, a non-existing entity).

I was planning to do a fair model comparison with the STAR alignments that's why I thought about using all models.  I will see, if I have time I may write a genome to transcriptome converting code - note, if you want me to share it with you, or have some suggestion on how to make it fast and efficient I'd also be happy to hear more suggestions.

Thanks!

Alexander Czerny

unread,
Dec 9, 2013, 7:51:02 AM12/9/13
to rna-...@googlegroups.com
Hi Daniel,
 u actually can map your data with STAR against a selfmade transcriptome. Filter the softclipped parts of reads (mentioned in the forum how to do), And use the rsem --sam option for counting until Alex has come up with some more simple solution ;)
If you are interested, i can give u a more detailed explanation how to do it.

greetings, Alex.

Daniel Fernandez

unread,
Apr 21, 2014, 4:24:58 PM4/21/14
to rna-...@googlegroups.com
Hi Alexander Czerny,

I am actually interested in mapping the STAR data against a selfmade transcriptome, I will take your offer :-)... would you mind giving me a more detailed explanation on how to do this?

Thanks a lot!

Alexander Czerny

unread,
Apr 30, 2014, 3:54:08 AM4/30/14
to
Hi Daniel,

here the link again were the main points where stated.
The important changes i marked bold. You are prohibiting Indels and Introns as well as puting out upto 100 multimaped alignments isntead of just 10 for RSEM EM-calculations.
basicly u are fitting STAR to behave like bwa/bowtie2:
 ionice -n 7 STAR --genomeDir $trans --readFilesIn $fastqgz --runThreadN $Nthreads  --readFilesCommand zcat --alignEndsType EndToEnd --genomeLoad NoSharedMemory --outFilterMultimapNmax 100 --outSAMunmapped Within --alignIntronMax 1 --alignIntronMin 2 --scoreDelOpen -10000  --scoreInsOpen -10000 --outFileNamePrefix ${output}${prefix}_T_${mmrate}_SE. --outFilterMismatchNoverLmax $mmrate

and afterwards you are calling RSEM on that. You can create the transcriptome also with rsem before (here) and index it then with STAR (note u have to set --genomeChrBinNbits  to 14 else it would take to much memory and crash).

Hope this helps, Alex.
Reply all
Reply to author
Forward
0 new messages