GC content and gene length

405 views
Skip to first unread message

SirPianoPlaya .

unread,
Jun 24, 2014, 2:39:06 PM6/24/14
to gen...@soe.ucsc.edu
I want to get the following output for the whole human genome:

Gene GC Length
ENSG00000000003.9.TSPAN6 40.87 11321
ENSG00000000005.5.TNMD 40.8 15083
ENSG00000000419.7.DPM1 39.85 23688
ENSG00000000457.8.SCYL3 40.34 41604
...


I was wondering if the UCSC table browser can do this, or if there is any other websites with this information. I greatly appreciate your help!

Brian Lee

unread,
Jun 26, 2014, 6:17:07 PM6/26/14
to SirPianoPlaya ., gen...@soe.ucsc.edu
Dear SirPianoPlaya,

Thank you for using the UCSC Genome Browser and your question about obtaining output for GC content for genes.

There is not a tool available at the UCSC Browser to exactly provide the output you are hoping for, you may want to look at Bioinformatic forums to see if something exists. Also, for future questions please search our mailing list archives for possible solutions, for example, this archived answer would be of interest: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/$20calculate$20gc$20percent$20/genome/UIWvbn9Zplc/JJDIem1lEIsJ

Here are some steps to make a similar kind of output with some scripts using our Table Browser. One immediate issue, you have accessions for genes, however, each gene like TSPAN6/ENSG00000000003.9 will have multiple transcripts for example (ENST00000496771.1), (ENST00000373020.4), (ENST00000431386.2) and (ENST00000494424.1). The following will be in regards to getting output for each transcript, where the steps will be to generate a custom track for each transcript specific to the examples you provided (V7 table), then to use a tool to extract GC percent and some command line scripts to do the summaries you are looking for.

1. Go to the Table Browser, http://genome.ucsc.edu/cgi-bin/hgTables, and select the hg19 assembly and pick "All Tables" and select the wgEncodeGencodeAttrsV7 table. (This is to match your identifiers, you have ENSG00000000003.9, where as in later versions it is ENSG00000000003.10).

2. Click the "paste list" button, and enter your transcript identifiers: 
ENSG00000000003.9 
ENSG00000000005.5
ENSG00000000419.7
ENSG00000000457.8

3. Set output to selected fields from primary and related tables and click get output. Set output file to "infoV7gencodeGenes".

4. Under Linked Tables click the box next to wgEncodeGencodeCompV7 and click allow selection from checked tables.

5. Click geneId geneName transcriptId from the hg19.wgEncodeGencodeAttrsV7 table and name chrom txStart txEnd and name2 from hg19.wgEncodeGencodeCompV7 and then get output. You'll get results like the following that will download to the infoV7gencodeGenes file you named:

ENSG00000000003.9 TSPAN6 ENST00000494424.1 ENST00000494424.1 chrX 99888438 99894988 TSPAN6
ENSG00000000005.5 TNMD ENST00000373031.4 ENST00000373031.4 chrX 99839798 99854882 TNMD
ENSG00000000005.5 TNMD ENST00000485971.1 ENST00000485971.1 chrX 99848620 99852528 TNMD
ENSG00000000419.7 DPM1 ENST00000371588.5 ENST00000371588.5 chr20 49551403 49575087 DPM1

6.Go to the downloaded file on the command line. Remove the first lines so you only have the output. Then split it up into a bed file: 
awk '{print $5 " " $6 " " $7 " " $2"."$1"."$3}' infoV7gencodeGenes > bedFileTranscripts
chrX 99888438 99894988 TSPAN6.ENSG00000000003.9.ENST00000494424.1
chrX 99839798 99854882 TNMD.ENSG00000000005.5.ENST00000373031.4
chrX 99848620 99852528 TNMD.ENSG00000000005.5.ENST00000485971.1
chr20 49551403 49575087 DPM1.ENSG00000000419.7.ENST00000371588.5

You could load this file on the browser as a custom track now.

Step2: Use the downloadable binary hgGcPercent and 2bit file to extract the GC information, then use some commands to massage output.

1. Get the appropriate version of hgGcPercent, use uname -a, then chmod +x once downloaded: http://hgdownload.soe.ucsc.edu/admin/exe/

2. Get the 2bit file:

3. Run the program on the bed file of interest: 
./hgGcPercent -bedRegionIn=bedFileTranscripts -bedRegionOut=bedFileTranscripts.hgGcPercents -noLoad hg19.2bit

chrX 99848620 99852528 1545 gcCount
chr20 49551403 49575087 9438 gcCount
chr20 49551403 49575087 9438 gcCount

4. Strip out the percent:
awk '{ tmp=($4)/($3-$2); printf"%.04f\n", tmp}' bedFileTranscripts.hgGcPercents > justPercents

0.4278
0.4080
0.3953
0.3985

5. Add that to the earlier bed file
paste bedFileTranscripts justPercents > bedFileTranscripts.added

6. Create a version of what you were desiring:
awk '{ tmp2=($3-$2); printf"\n" $4 " " $5 " " tmp2 }' bedFileTranscripts.added

TSPAN6.ENSG00000000003.9.ENST00000494424.1 0.4278 6550
TNMD.ENSG00000000005.5.ENST00000373031.4 0.4080 15084
TNMD.ENSG00000000005.5.ENST00000485971.1 0.3953 3908
DPM1.ENSG00000000419.7.ENST00000371588.5 0.3985 23684

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages