Thanks. I got the CDS length, but I have two questions.
(1)I found for some coding sequence, there is no annotation for UTRs, does it mean the UTRs are not known? And some coding sequence only have annotation for either 3'UTR or 5'UTR. Does it mean the other side UTR is not known?
(2)For some coding sequence I wonder why there is a huge difference between transcript length (exclude the UTRs) and coding length?
Take one hit for gene "HIST2H4A" as an example (see below) the 3UTR_length=149804678-149804561=117, 5UTR_length=149804248-149804221=27.
the transcript length is 955, so 955-117-27=811. Why the length of coding sequence is only 312?
Examples:
GeneName CDS_Length 3' UTR _End 3' UTR_Start 5' UTR_End 5' UTR_Start Transcript_length Ensembl_Transcript_ID
CCDC163P 357 45965751 45965282 2229 ENST00000415578 1
CCDC163P 357 2229 ENST00000415578
HIST2H4A 312 149804678 149804561 149804248 14980422 1 955 ENST00000369165
> GRCh37 = useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)> transcript_info=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_start","transcript_end", "transcript_length","5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end","cds_length","ensembl_exon_id"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)> transcript_infoensembl_transcript_id external_gene_name transcript_start transcript_end transcript_length 5_utr_start1 ENST00000369165 HIST2H4A 149804221 149806197 955 1498042212 ENST00000369165 HIST2H4A 149804221 149806197 955 NA5_utr_end 3_utr_start 3_utr_end cds_length ensembl_exon_id1 149804248 149804561 149804678 312 ENSE000026886712 NA 149805701 149806197 312 ENSE00002715154
| Name | Transcript ID | bp | Protein |
|---|---|---|---|
| HIST2H4A-004 | ENST00000610125 | 716 | 103aa |
| HIST2H4A-003 | ENST00000392939 | 1696 | 103aa |
| HIST2H4A-001 | ENST00000369165 | 955 | 103aa |
| HIST2H4A-002 | ENST00000392938 | 610 | 103aa |
transcript_info2=getBM(attributes=c("ensembl_transcript_id","external_gene_name","transcript_length","ensembl_exon_id","exon_chrom_start","exon_chrom_end","rank"),filters=c('ensembl_transcript_id'),values="ENST00000369165",mart=GRCh37)> transcript_info2ensembl_transcript_id external_gene_name transcript_length ensembl_exon_id exon_chrom_start exon_chrom_end1 ENST00000369165 HIST2H4A 955 ENSE00002688671 149804221 1498046782 ENST00000369165 HIST2H4A 955 ENSE00002715154 149805701 149806197rank1 12 2