STAR aligner mapping to transcriptome

Alexander Czerny

unread,

May 27, 2013, 4:19:44 AM5/27/13

to rna-...@googlegroups.com

Hi people,

iam new to the forum but i read in it for a long time and it helped me well, till now. I have 3 questions, where i hope u can help me answer them:

#1 at generating the genome for human i want to annotate it like this:
facts so far:
- i got single end reads 50 bp and
- ensembl whole_genome.fa
- for annotation i use gencode.v16.annotation.gtf, what i found fitting for the annotaion in this forum

command:
STAR --runMode genomeGenerate --genomeDir "path/Referenzgenom_hg19_STAR/Ensembl" --genomeFastaFiles $hg19_ensembl \
--runThreadN 50 --sjdbFileChrStartEnd $gtf_human --sjdbOverhang 49 --genomeChrBinNbits 16

but when i map my data against it i get no annotated sj and i didnt got a data file in the genome dir for the sj_annotation. Is the sjdbOverhand correct ?

#2 How exactly can i map my data against the transcriptome and afterwards leftovers against the genome ? How are u doing this.

I got so far a indexed transcriptome from ensembl and the genome from above.

Which leads me to question #3:

I mapped against my transcriptome, but i found splicejunctions also in the final.out.log, but i think this shouldnt happen when i map against a transcriptome since all exon are allready "sticked" together, or how i should interpret this ?

I hope u can help me and thanks in advance.

Alex.

Alexander Dobin

unread,

May 29, 2013, 12:03:11 PM5/29/13

to rna-...@googlegroups.com

Hi Alex,

1. When you generate the genome, you need to use --sjdbGTFfile <annotation.gtf> at the genome generation step. Also, please check that the chromosome names are the same in gencode.gtf and ENSEMBL whole_genome.fa files. Gencode uses "chr" in chromosome names, while ENSEMBL does not, so you have to be careful.

2. If your genome generation with --sjdb* option works, STAR will be mapping to transcriptome and genome simultaneously, and will select the best alignment. You do not need to "map to transcriptome first, then map to the genome".

3. If you really want to map to transcriptome, you would need to generate a genome with sequences of all annotated transcripts, each transcript will be a separate "chromosome". Note, that alignments in the .sam file will be given in the "transcriptome" coordinates. The spliced reads within the transcriptome, mostly non-canonical, will correspond to unannotated (novel) junctions.

Cheers

Alex

Alexander Czerny

unread,

Jun 10, 2013, 6:41:57 AM6/10/13

to rna-...@googlegroups.com

Hi Alex,

thx for your help, works fine now.

greetings, Alex.

broder...@googlemail.com

unread,

May 16, 2016, 4:33:28 AM5/16/16

to rna-star

Hi Alex,

3. If you really want to map to transcriptome, you would need to generate a genome with sequences of all annotated transcripts, each transcript will be a separate "chromosome". Note, that alignments in the .sam file will be given in the "transcriptome" coordinates. The spliced reads within the transcriptome, mostly non-canonical, will correspond to unannotated (novel) junctions.

I would like to map to the transcriptome as described by you here under 3. I understood that it would be generally better to map against the genome to not force the reads into the transcripts. However, in this case we have few reads and loose some of them into the genome space. So we want to increase the sensitivity knowing that this decreases the specificity at the same time.

Well, I already generated the genome index, with each transcript being a separate "chromosome". Now I am struggling with getting a GTF file that has the correct coordinates to correspond with the SAM file. As a starting point I have the following files:

ftp://ftp.ensembl.org/pub/release-84/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

ftp://ftp.ensembl.org/pub/release-84/gtf/mus_musculus/Mus_musculus.GRCm38.84.gtf.gz

There seems to be no GTF file from ENSEMBL that is specifically for the transcriptome. So I guess I would have to use the genome GTF as a starting point.

Do you have an idea how I can go about this most effectively?

Cheers

Broder

Alexander Dobin

unread,

May 16, 2016, 5:22:45 PM5/16/16

to rna-star

Hi Broder,

when mapping to transcriptome, you do not need the GTF file, since all the information about transcripts is already contained in the sequnces of the transcriptome file.

Cheers

Alex

On Monday, May 16, 2016 at 4:33:28 AM UTC-4,

A R

unread,

Mar 31, 2021, 1:33:10 PM3/31/21

to rna-star

Hello Alex,

I have a similar problem. I am trying to map shrimp mRNA transcripts to a (rather small) transcriptome FASTA file without using a GTF/GFF file. However, I get the following error when using the flag "--quantMode TranscriptomeSAM".

----------

Transcriptome.cpp:14:Transcriptome: exiting because of *INPUT FILE* error: could not open input file ./STAR_genome_index//geneInfo.tab
Solution: check that the file exists and you have read permission for this file
SOLUTION: utilize --sjdbGTFfile /path/to/annotations.gtf option at the genome generation step or mapping step

-----------

I do have all permissions set. It works when I remove the --quantMode TranscriptomeSAM flag, but I really wanted to get counts for our project. Can you let me know if there is a way to use the flag without a GTF file?

Many thanks,

Anna Rawles

Alexander Dobin

unread,

Mar 31, 2021, 5:37:16 PM3/31/21

to rna-star

Hi Anna,

if you are mapping to the transcriptome, then the Aligned.out.bam already contains alignments to the transcripts - so that you do not need --quantMode TranscriptomeSAM option (which you cannot have without the GTF).

Cheers

Alex

A R

unread,

Apr 13, 2021, 10:20:37 AM4/13/21

to rna-star

Hi Alex,

I apologize for my late response - thank you for the informative reply!

Sincerely,

Anna

Reply all

Reply to author

Forward