Hi,
I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the percentage_gc_content for both transcripts is the same (36.56)
I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.
1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?
2. While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.
The following is the R code
##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="
www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
x1<-DNAString(x[,1])
alf <- alphabetFrequency(x1, as.prob=TRUE)
data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
attributes = c("ensembl_transcript_id",
"percentage_gc_content","transcript_length"),
mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])
##and the output
> print(t2g)
ensembl_transcript_id percentage_gc_content transcript_length
1 ENSMUST00000187148 36.56 2846
2 ENSMUST00000115891 36.56 2816
> GCperc(seq[1,])
ensembl_transcript_id length GCperc
1 ENSMUST00000115891 2816 40.19886
> GCperc(seq[2,])
ensembl_transcript_id length GCperc
1 ENSMUST00000187148 2846 39.45889