transcript length or coding length

Quanwei Zhang

unread,

Jul 16, 2015, 11:56:39 AM7/16/15

to biomar...@googlegroups.com

I get the transcript length for protein coding genes through bimart for Human.
The transcript length is the total length of mature RNA, right? And the UTR regions are covered by transcript?
Is there a way to get the length of coding region(i.e., translated region)?
Thanks

William Spooner

unread,

Jul 17, 2015, 4:32:01 AM7/17/15

to Quanwei Zhang, biomar...@googlegroups.com

Hi,

There is a 'CDS length' attribute under the 'Structures' attribute section.

Best,

Will

> --
> You received this message because you are subscribed to the Google Groups
> "biomart-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to biomart-user...@googlegroups.com.
> Visit this group at http://groups.google.com/group/biomart-users.
> For more options, visit https://groups.google.com/d/optout.

--
William Spooner
Chief Science Officer
M: +44 (0)7779663045 | T: @wspoonr | L: linkedin
Eagle Genomics Ltd
Disclaimer

Quanwei Zhang

unread,

Jul 17, 2015, 10:37:15 AM7/17/15

to biomar...@googlegroups.com, qwzha...@gmail.com

Thanks. I got the CDS length, but I have two questions.

(1)I found for some coding sequence, there is no annotation for UTRs, does it mean the UTRs are not known? And some coding sequence only have annotation for either 3'UTR or 5'UTR. Does it mean the other side UTR is not known?

(2)For some coding sequence I wonder why there is a huge difference between transcript length (exclude the UTRs) and coding length?

Take one hit for gene "HIST2H4A" as an example (see below) the 3UTR_length=149804678-149804561=117, 5UTR_length=149804248-149804221=27.

the transcript length is 955, so 955-117-27=811. Why the length of coding sequence is only 312?

Examples:

GeneName CDS_Length 3' UTR _End 3' UTR_Start 5' UTR_End 5' UTR_Start Transcript_length Ensembl_Transcript_ID

CCDC163P 357 45965751 45965282 2229 ENST00000415578 1

CCDC163P 357 2229 ENST00000415578

HIST2H4A 312 149804678 149804561 149804248 14980422 1 955 ENST00000369165

William Spooner

unread,

Jul 17, 2015, 1:01:50 PM7/17/15

to Quanwei Zhang, biomar...@googlegroups.com

On Fri, Jul 17, 2015 at 3:37 PM, Quanwei Zhang <qwzha...@gmail.com> wrote:
>
>
> Thanks. I got the CDS length, but I have two questions.
>
> (1)I found for some coding sequence, there is no annotation for UTRs, does
> it mean the UTRs are not known? And some coding sequence only have
> annotation for either 3'UTR or 5'UTR. Does it mean the other side UTR is not
> known?

Correct

>
> (2)For some coding sequence I wonder why there is a huge difference between
> transcript length (exclude the UTRs) and coding length?
>
> Take one hit for gene "HIST2H4A" as an example (see below) the
> 3UTR_length=149804678-149804561=117, 5UTR_length=149804248-149804221=27.
>
> the transcript length is 955, so 955-117-27=811. Why the length of coding
> sequence is only 312?
>
>
> Examples:
>
> GeneName CDS_Length 3' UTR _End 3' UTR_Start 5' UTR_End 5' UTR_Start
> Transcript_length Ensembl_Transcript_ID
>
> CCDC163P 357 45965751 45965282
> 2229 ENST00000415578 1
>
> CCDC163P 357 2229
> ENST00000415578
>
> HIST2H4A 312 149804678 149804561 149804248 14980422 1 955
> ENST00000369165

Do the 5' UTR_Start and 3' UTR _End coordinates correspond with the
transcript start/end coordinates? Does not look like it to me. That
suggests there's a error with the UTR coordinates in the database.

http://feb2014.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000183941;r=1:149804221-149806197;t=ENST00000369165

Best,

Will

Thomas Maurel

unread,

Jul 20, 2015, 6:35:00 AM7/20/15

to William Spooner, Quanwei Zhang, biomar...@googlegroups.com

Hello,

Yes, the 5’ UTR_Start and 3’ UTR _End coordinates correspond with the transcript start/end coordinates as you see below in bold. The line above was only displaying the values for the Transcript Exon 1: ENSE00002688671.

> GRCh37 = useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)
> transcript_info=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_start","transcript_end", "transcript_length","5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end","cds_length","ensembl_exon_id"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)
> transcript_info
ensembl_transcript_id external_gene_name transcript_start transcript_end transcript_length 5_utr_start
1 ENST00000369165 HIST2H4A 149804221 149806197 955 149804221
2 ENST00000369165 HIST2H4A 149804221 149806197 955 NA
5_utr_end 3_utr_start 3_utr_end cds_length ensembl_exon_id
1 149804248 149804561 149804678 312 ENSE00002688671
2 NA 149805701 149806197 312 ENSE00002715154

I believe the confusion here is coming from the mart “Transcript length” attribute which is actually displaying the full Transcript length including UTR and CDS (as displayed on the Ensembl website in the Transcript table):

Name	Transcript ID	bp	Protein
HIST2H4A-004	ENST00000610125	716	103aa
HIST2H4A-003	ENST00000392939	1696	103aa
HIST2H4A-001	ENST00000369165	955	103aa
HIST2H4A-002	ENST00000392938	610	103aa

The Transcript length in Ensembl is the sum of the Exon length:

transcript_info2=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_length","ensembl_exon_id","exon_chrom_start","exon_chrom_end","rank"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)
> transcript_info2
ensembl_transcript_id external_gene_name transcript_length ensembl_exon_id exon_chrom_start exon_chrom_end
1 ENST00000369165 HIST2H4A 955 ENSE00002688671 149804221 149804678
2 ENST00000369165 HIST2H4A 955 ENSE00002715154 149805701 149806197
rank
1 1
2 2

Exon 1 length: 149804678-149804221+1=458

Exon 2 length: 149806197-149805701+1= 497

Transcript length= 458+497= 955

The Exon page of the Ensembl website with the line numbering “Relative to the coding sequence” turned on confirm that the cds length is 312 and that the Exon1 and Exon2 length are 458 and 497: (http://grch37.ensembl.org/Homo_sapiens/Share/148b9c0106c8adafc7543cb8e33aa175194447320).

Hope this helps,

Regards,

Thomas

--

Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Reply all

Reply to author

Forward