gencode annotation and align to transcriptome

78 views
Skip to first unread message

K. Zhang

unread,
Sep 29, 2016, 9:05:57 AM9/29/16
to rna-star
Dear all,

The annotation file I used is gencode v25 gtf file, and gene/transcript version number was appended to gene and transcript ids (i.e. ENST00000456328.2). But in my Aligned.toTranscriptome.out.bam file the transcript version number is missing: 

@SQ     SN:ENST00000456328      LN:1657
@SQ     SN:ENST00000450305      LN:632
@SQ     SN:ENST00000488147      LN:1351
@SQ     SN:ENST00000619216      LN:68
@SQ     SN:ENST00000473358      LN:712
@SQ     SN:ENST00000469289      LN:535
                      .
                      .
                      .
K00114:365:HC77WBBXX:2:1102:11738:45432 419     ENST00000565948 2339    3       24M     =       2448    197     AAACCTCAATAGTGCCCGCCGCAT        JAAFJJJ<7AJ<7F7FA-AF-A-7        NH:i:2  HI:i:1
K00114:365:HC77WBBXX:2:1102:11738:45432 339     ENST00000565948 2448    3       88M     =       2339    -197    CGGAGAAATGTCAACTGGGAACAGGTCATTCAGCAAGTAACCAAGAAAAAGCAAGAGCTGGGCAAAGGCTTACCCAGGTTTGGCATAG        AFJJFFFJFJFJFFAJJFJFJJFFJFF-A<FJJJJF<FJJFF<J<F<F7JJJJAJJJJJJJJJJJJJJJFJJJJJFJJJFJJJJJJFF        NH:i:2  HI:i:1
K00114:365:HC77WBBXX:2:1102:11738:45432 163     ENST00000263805 2574    3       24M     =       2683    197     AAACCTCAATAGTGCCCGCCGCAT        JAAFJJJ<7AJ<7F7FA-AF-A-7        NH:i:2  HI:i:2

Does anyone know why this happens? I was using eXpress for quantification, but it did not work as the target number in reference file (i.e. ENST00000445290.1) is not consistent in alignment (i.e. ENST00000445290) file.

Thank you very much in advance!
Kaiyang

Alexander Dobin

unread,
Sep 30, 2016, 2:54:25 PM9/30/16
to rna-star
Hi Kaiyang,

I do not see this problem in my test runs. Please send me the Log.out file from the genome generation file, and a few lines of the gtf file.

Cheers
Alex

Alexander Dobin

unread,
Sep 30, 2016, 3:05:34 PM9/30/16
to rna-star
It seems like this problem is related to the one described in this post:

It seems like GENCODE formatting has changed in the recent release, and now contains the "transcript_version" field separate from "transcript_id".
The quick fix for this would be to add the transcript_version to transcript_id in the gtf file before generating the genome index.

Cheers
Alex

K. Zhang

unread,
Oct 1, 2016, 5:07:05 PM10/1/16
to rna-star
Hi Alex,

The transcript_version is attached to transcript_id in the gtf file I was using: 

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

It is strange because the transcript_version is together with the transcript_id in the transcriptInfo.tab file in the genome folder, but the transcript_version is missing in the Aligned.toTranscriptome.out.bam.

A few lines of the transcriptInfo.tab: 
198093
ENST00000456328.2       11868   14408   14408   1       3       0
ENST00000450305.2       12009   13669   14408   1       6       3
ENST00000488147.1       14403   29569   14408   2       11      9
ENST00000619216.1       17368   17435   29569   2       1       20
ENST00000473358.1       29553   31096   29569   1       3       21
ENST00000469289.1       30266   31108   31096   1       2       24

The Log.out file from the genome generation is attached and I was using STAR_2.5.1b_modified. Thanks a lot!

Best regards,
Kaiyang
Log.out

Alexander Dobin

unread,
Oct 4, 2016, 1:23:53 PM10/4/16
to rna-star
Hi Kaiyang,

this is very strange - STAR does not change the "transcript_id" from the gtf, and the ids for transcriptInfo.tab are taken directly fromtranscriptInfo.tab .
Please send me the Log.out file from the mapping stage.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages