transcript length or coding length

1,198 views
Skip to first unread message

Quanwei Zhang

unread,
Jul 16, 2015, 11:56:39 AM7/16/15
to biomar...@googlegroups.com
I get the transcript length for protein coding genes through bimart for Human.
The transcript length is the total length of mature RNA, right? And the UTR regions are covered by transcript?
 Is there a way to get the length of coding region(i.e., translated region)?
Thanks

William Spooner

unread,
Jul 17, 2015, 4:32:01 AM7/17/15
to Quanwei Zhang, biomar...@googlegroups.com
Hi,

There is a 'CDS length' attribute under the 'Structures' attribute section.

Best,

Will
> --
> You received this message because you are subscribed to the Google Groups
> "biomart-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to biomart-user...@googlegroups.com.
> Visit this group at http://groups.google.com/group/biomart-users.
> For more options, visit https://groups.google.com/d/optout.



--
William Spooner
Chief Science Officer
M: +44 (0)7779663045 | T: @wspoonr | L: linkedin
Eagle Genomics Ltd
Disclaimer

Quanwei Zhang

unread,
Jul 17, 2015, 10:37:15 AM7/17/15
to biomar...@googlegroups.com, qwzha...@gmail.com


Thanks. I got the CDS length, but I have two questions.

(1)I found for some coding sequence, there is no annotation for UTRs, does it mean the UTRs are not known? And some coding sequence only have annotation for either 3'UTR or 5'UTR. Does it mean the other side UTR is not known?

(2)For some coding sequence I wonder why there is a huge difference between transcript length (exclude the UTRs) and coding length?

Take one hit for gene "HIST2H4A" as an example (see below) the 3UTR_length=149804678-149804561=117, 5UTR_length=149804248-149804221=27.

the transcript length is 955, so 955-117-27=811. Why the length of coding sequence is only 312?


Examples:

GeneName     CDS_Length 3' UTR _End 3' UTR_Start 5' UTR_End 5' UTR_Start Transcript_length Ensembl_Transcript_ID

CCDC163P       357                                   45965751 45965282      2229 ENST00000415578 1

CCDC163P       357                                                     2229 ENST00000415578

HIST2H4A   312   149804678  149804561  149804248  14980422   1 955   ENST00000369165

William Spooner

unread,
Jul 17, 2015, 1:01:50 PM7/17/15
to Quanwei Zhang, biomar...@googlegroups.com
On Fri, Jul 17, 2015 at 3:37 PM, Quanwei Zhang <qwzha...@gmail.com> wrote:
>
>
> Thanks. I got the CDS length, but I have two questions.
>
> (1)I found for some coding sequence, there is no annotation for UTRs, does
> it mean the UTRs are not known? And some coding sequence only have
> annotation for either 3'UTR or 5'UTR. Does it mean the other side UTR is not
> known?

Correct



>
> (2)For some coding sequence I wonder why there is a huge difference between
> transcript length (exclude the UTRs) and coding length?
>
> Take one hit for gene "HIST2H4A" as an example (see below) the
> 3UTR_length=149804678-149804561=117, 5UTR_length=149804248-149804221=27.
>
> the transcript length is 955, so 955-117-27=811. Why the length of coding
> sequence is only 312?
>
>
> Examples:
>
> GeneName CDS_Length 3' UTR _End 3' UTR_Start 5' UTR_End 5' UTR_Start
> Transcript_length Ensembl_Transcript_ID
>
> CCDC163P 357 45965751 45965282
> 2229 ENST00000415578 1
>
> CCDC163P 357 2229
> ENST00000415578
>
> HIST2H4A 312 149804678 149804561 149804248 14980422 1 955
> ENST00000369165

Do the 5' UTR_Start and 3' UTR _End coordinates correspond with the
transcript start/end coordinates? Does not look like it to me. That
suggests there's a error with the UTR coordinates in the database.

http://feb2014.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000183941;r=1:149804221-149806197;t=ENST00000369165

Best,

Will

Thomas Maurel

unread,
Jul 20, 2015, 6:35:00 AM7/20/15
to William Spooner, Quanwei Zhang, biomar...@googlegroups.com
Hello, 

Yes, the 5’ UTR_Start and 3’ UTR _End coordinates correspond with the transcript start/end coordinates as you see below in bold. The line above was only displaying the values for the Transcript Exon 1: ENSE00002688671.

> GRCh37 = useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)
> transcript_info=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_start","transcript_end", "transcript_length","5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end","cds_length","ensembl_exon_id"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)
> transcript_info
  ensembl_transcript_id external_gene_name transcript_start transcript_end transcript_length 5_utr_start
1       ENST00000369165           HIST2H4A        149804221      149806197               955   149804221
2       ENST00000369165           HIST2H4A        149804221      149806197               955          NA
  5_utr_end 3_utr_start 3_utr_end cds_length ensembl_exon_id
1 149804248   149804561 149804678        312 ENSE00002688671
2        NA   149805701 149806197        312 ENSE00002715154

I believe the confusion here is coming from the mart “Transcript length” attribute which is actually displaying the full Transcript length including UTR and CDS (as displayed on the Ensembl website in the Transcript table):
NameTranscript IDbpProtein
HIST2H4A-004ENST00000610125716103aa
HIST2H4A-003ENST000003929391696103aa
HIST2H4A-001ENST00000369165955103aa
HIST2H4A-002ENST00000392938610103aa

The Transcript length in Ensembl is the sum of the Exon length:

transcript_info2=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_length","ensembl_exon_id","exon_chrom_start","exon_chrom_end","rank"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)
> transcript_info2
  ensembl_transcript_id external_gene_name transcript_length ensembl_exon_id exon_chrom_start exon_chrom_end
1       ENST00000369165           HIST2H4A               955 ENSE00002688671        149804221      149804678
2       ENST00000369165           HIST2H4A               955 ENSE00002715154        149805701      149806197
  rank
1    1
2    2

Exon 1 length: 149804678-149804221+1=458
Exon 2 length: 149806197-149805701+1= 497

Transcript length= 458+497= 955

The Exon page of the Ensembl website with the line numbering “Relative to the coding sequence” turned on confirm that the cds length is 312 and that the Exon1 and Exon2 length are 458 and 497: (http://grch37.ensembl.org/Homo_sapiens/Share/148b9c0106c8adafc7543cb8e33aa175194447320).

Hope this helps,
Regards,
Thomas

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Reply all
Reply to author
Forward
0 new messages