gencode annotation and align to transcriptome

K. Zhang

unread,

Sep 29, 2016, 9:05:57 AM9/29/16

to rna-star

Dear all,

The annotation file I used is gencode v25 gtf file, and gene/transcript version number was appended to gene and transcript ids (i.e. ENST00000456328.2). But in my Aligned.toTranscriptome.out.bam file the transcript version number is missing:

@SQ SN:ENST00000456328 LN:1657

@SQ SN:ENST00000450305 LN:632

@SQ SN:ENST00000488147 LN:1351

@SQ SN:ENST00000619216 LN:68

@SQ SN:ENST00000473358 LN:712

@SQ SN:ENST00000469289 LN:535

.

K00114:365:HC77WBBXX:2:1102:11738:45432 419 ENST00000565948 2339 3 24M = 2448 197 AAACCTCAATAGTGCCCGCCGCAT JAAFJJJ<7AJ<7F7FA-AF-A-7 NH:i:2 HI:i:1

K00114:365:HC77WBBXX:2:1102:11738:45432 339 ENST00000565948 2448 3 88M = 2339 -197 CGGAGAAATGTCAACTGGGAACAGGTCATTCAGCAAGTAACCAAGAAAAAGCAAGAGCTGGGCAAAGGCTTACCCAGGTTTGGCATAG AFJJFFFJFJFJFFAJJFJFJJFFJFF-A<FJJJJF<FJJFF<J<F<F7JJJJAJJJJJJJJJJJJJJJFJJJJJFJJJFJJJJJJFF NH:i:2 HI:i:1

K00114:365:HC77WBBXX:2:1102:11738:45432 163 ENST00000263805 2574 3 24M = 2683 197 AAACCTCAATAGTGCCCGCCGCAT JAAFJJJ<7AJ<7F7FA-AF-A-7 NH:i:2 HI:i:2

Does anyone know why this happens? I was using eXpress for quantification, but it did not work as the target number in reference file (i.e. ENST00000445290.1) is not consistent in alignment (i.e. ENST00000445290) file.

Thank you very much in advance!

Kaiyang

Alexander Dobin

unread,

Sep 30, 2016, 2:54:25 PM9/30/16

to rna-star

Hi Kaiyang,

I do not see this problem in my test runs. Please send me the Log.out file from the genome generation file, and a few lines of the gtf file.

Cheers

Alex

Alexander Dobin

unread,

Sep 30, 2016, 3:05:34 PM9/30/16

to rna-star

It seems like this problem is related to the one described in this post:

https://groups.google.com/d/msg/rna-star/nghNt7h_WTo/2tOh8-UjAwAJ

It seems like GENCODE formatting has changed in the recent release, and now contains the "transcript_version" field separate from "transcript_id".

The quick fix for this would be to add the transcript_version to transcript_id in the gtf file before generating the genome index.

Cheers

Alex

K. Zhang

unread,

Oct 1, 2016, 5:07:05 PM10/1/16

to rna-star

Hi Alex,

The transcript_version is attached to transcript_id in the gtf file I was using:

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

It is strange because the transcript_version is together with the transcript_id in the transcriptInfo.tab file in the genome folder, but the transcript_version is missing in the Aligned.toTranscriptome.out.bam.

A few lines of the transcriptInfo.tab:

198093

ENST00000456328.2 11868 14408 14408 1 3 0

ENST00000450305.2 12009 13669 14408 1 6 3

ENST00000488147.1 14403 29569 14408 2 11 9

ENST00000619216.1 17368 17435 29569 2 1 20

ENST00000473358.1 29553 31096 29569 1 3 21

ENST00000469289.1 30266 31108 31096 1 2 24

The Log.out file from the genome generation is attached and I was using STAR_2.5.1b_modified. Thanks a lot!

Best regards,

Kaiyang

Log.out

Alexander Dobin

unread,

Oct 4, 2016, 1:23:53 PM10/4/16

to rna-star

Hi Kaiyang,

this is very strange - STAR does not change the "transcript_id" from the gtf, and the ids for transcriptInfo.tab are taken directly fromtranscriptInfo.tab .

Please send me the Log.out file from the mapping stage.

Cheers

Alex

Reply all

Reply to author

Forward