Understand the percentage_gc_content

256 views
Skip to first unread message

Yongchao Ge

unread,
Dec 20, 2016, 2:53:57 PM12/20/16
to biomart-users
Hi,

I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the  percentage_gc_content for both transcripts is the same (36.56)

I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.

1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?

2.  While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.

The following is the R code

##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])


##and the output
> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889

Thomas Maurel

unread,
Dec 21, 2016, 5:44:30 AM12/21/16
to Yongchao Ge, biomart-users
Hello,

1. In the Ensembl gene mart, the percentage_gc_content actually correspond to the Gene %GC content. This is why all the Transcripts of gene ENSMUSG00000000103 will return the value of 36.56.
We can rename this attribute on the interface to “Gene % GC content” and the BiomaRt attribute to “percentage_gene_gc_content” for our next release e!88 if that can make things clearer. 
2. The Bioconductor people are looking after the BiomaRt R module, could you please report this to the Bioconductor forum: https://support.bioconductor.org/

Hope this helps,
Kind Regards,
Thomas
--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biomart-user...@googlegroups.com.
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Yongchao Ge

unread,
Dec 21, 2016, 9:15:07 AM12/21/16
to Thomas Maurel, biomart-users
Thanks Thomas for the reply,

Can you give more details on how "Gene %GC content" is computed?

For example, how do you define the sequence of a gene since a gene is collection of many transcripts (isoforms) where different isoforms have different nucleotide sequences?

After an initial guess and manual checking for the example gene ENSMUSG00000000103, you were probably collecting all of  the nucleotide bases that are between the gene starting position and the gene end position, regardless of the position being in the introns of all transcripts. If that is the case,  I'm wondering where the "Gene %GC content" can be useful in applications.

Yongchao



To unsubscribe from this group and stop receiving emails from it, send an email to biomart-users+unsubscribe@googlegroups.com.

Thomas Maurel

unread,
Dec 21, 2016, 9:31:17 AM12/21/16
to Yongchao Ge, biomart-users
Dear Yongchao,

I am afraid that is beyond my knowledge, could you please email the Ensembl Helpdesk: http://www.ensembl.org/Help/Contact. Someone there should be able to tell you how we generate the “Gene %GC content” in Ensembl.

Hope this helps,
Kind Regards,
Thomas

Ivan Molineris

unread,
Jun 20, 2018, 9:42:39 AM6/20/18
to biomart-users
Hi Yongchao,
if you finally found the answer, can you post the explanation in this thread?
In this way other people searching for it (like me) can find it.

Thanks
Thomas
To unsubscribe from this group and stop receiving emails from it, send an email to biomart-user...@googlegroups.com.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Reply all
Reply to author
Forward
0 new messages