STAR alignment managment of multiple transcripts in GTF

George Wiggins

unread,

Oct 26, 2015, 10:56:03 PM10/26/15

to rna-star

I have come across some points of confusing with how STAR (and potentially multiple other aligners) deal with multiple transcript ID's. With the nature of sequencing we have done (Targeted-RNA seq) we only get sequence information across exon junctions for particular genes. Therefore the count information in the SJ.out.tab file should be all we need (correct me if you think otherwise). Column 7 = number of uniquely mapping reads crossing the junction, while column 8 =number of multi-mapping reads crossing the junction. My question is if your gene has 10 transcripts how does STAR deal with them? Let's assume all these 10 transcripts have exon 3 and 4, would the reads over these junction map to all 10 transcripts and therefore end up as a multi-mapping read or does STAR have some way of dealing with this?

Similarly, does the alignment (SAM file) manage to deal with multiple similar transcripts?

The second issue I am having, which you may also be able to help with, is using the chromosomal coordinates provided by the SJ.out.tab file how can I go about annotating each junctions (to include this information gene_name, exon_start[the first exon in which the read begins] and exon_stop [the last exon in which the read ends]). My current method is having the same issue about multiple transcripts.

Alexander Dobin

unread,

Oct 28, 2015, 6:46:27 PM10/28/15

to rna-star

Hi George,

I think there some confusion in definition of "multi-mapping" reads. STAR is normally used for mapping to the genome, so the unique (multi-) mappers are defines as those apping to 1 (>1) loci in the genome. However, many quantifications software require reads mapped to the transcriptome, and in their definitions mult-mappers are those reads that map to the >1 transcript, even if these reads map uniquely to the genome. Note, that the reverse could also be true - a read can map uniquely to the transcriptome, but to multiple loci in the genome (one will be annotated transcript, other - unannotated loci).

SJ.out.tab count unique/multi mapping reads with respect to the genome. So, just because a junction is shared between transcripts, the reads crossing it will not be considered multimappers.

To translate junctions from genome to transcript coordinates, you would need to create a database of junctions for each of the transcripts, and then compare a junction from SJ.out.tab to this database - so a "unique" junction may belong to a number of different transcripts.

You can also try --quantMode TranscriptomeSAM, which outputs a BAM file with alignments converted into transcriptomic coordinates, but then you would have to track which junctions are crossed by these alignments along the transcript sequence.

Cheers

Alex

George Wiggins

unread,

Oct 29, 2015, 4:08:47 PM10/29/15

to rna-...@googlegroups.com

Hi Alex,

That is what I had arrived to after doing a couple of test alignments with different gtf files and noticing no effect on the SJ.out.tab file.
When generating a genome there is the option to pass in a gtf file (--sjdbGTFfile) I assumed this help define the junctions, is there any benefit in doing this if I was to do 2Pass anyway? No new junctions were discovered when I aligned my reads to a genome generated with or without the --sjdbGTFfile line, however this could be due to the nature of my sequencing results (target RNA sequence, with probes to detect junctions).

Regardless, your explanation has given me confidence in the results we have. We a building a method to annotate the junctions with exon information, this is where the issue of multi-transcripts arose. I wanted to be sure before we pushed on that the alignment was to the genome not transcriptome.

Alexander Dobin

unread,

Nov 2, 2015, 3:33:07 PM11/2/15

to rna-star

Hi George,

compare to the 2-pass approach, annotations (--sjdbGTFfile) help to detect extra junctions that are supported by a very few reads.

I guess that in your targeted sequencing, all of the junctions of interest are crossed by a large number of reads, so annotations do not affect the results.