Dear SirPianoPlaya,
Thank you for using the UCSC Genome Browser and your question about obtaining output for GC content for genes.
Here are some steps to make a similar kind of output with some scripts using our Table Browser. One immediate issue, you have accessions for genes, however, each gene like TSPAN6/ENSG00000000003.9 will have multiple transcripts for example (ENST00000496771.1), (ENST00000373020.4), (ENST00000431386.2) and (ENST00000494424.1). The following will be in regards to getting output for each transcript, where the steps will be to generate a custom track for each transcript specific to the examples you provided (V7 table), then to use a tool to extract GC percent and some command line scripts to do the summaries you are looking for.
1. Go to the Table Browser,
http://genome.ucsc.edu/cgi-bin/hgTables, and select the hg19 assembly and pick "All Tables" and select the wgEncodeGencodeAttrsV7 table. (This is to match your identifiers, you have ENSG00000000003.9, where as in later versions it is ENSG00000000003.10).
2. Click the "paste list" button, and enter your transcript identifiers:
ENSG00000000003.9
ENSG00000000005.5
ENSG00000000419.7
ENSG00000000457.8
3. Set output to selected fields from primary and related tables and click get output. Set output file to "infoV7gencodeGenes".
4. Under Linked Tables click the box next to wgEncodeGencodeCompV7 and click allow selection from checked tables.
5. Click geneId geneName transcriptId from the hg19.wgEncodeGencodeAttrsV7 table and name chrom txStart txEnd and name2 from hg19.wgEncodeGencodeCompV7 and then get output. You'll get results like the following that will download to the infoV7gencodeGenes file you named:
ENSG00000000003.9 TSPAN6 ENST00000494424.1 ENST00000494424.1 chrX 99888438 99894988 TSPAN6
ENSG00000000005.5 TNMD ENST00000373031.4 ENST00000373031.4 chrX 99839798 99854882 TNMD
ENSG00000000005.5 TNMD ENST00000485971.1 ENST00000485971.1 chrX 99848620 99852528 TNMD
ENSG00000000419.7 DPM1 ENST00000371588.5 ENST00000371588.5 chr20 49551403 49575087 DPM1
6.Go to the downloaded file on the command line. Remove the first lines so you only have the output. Then split it up into a bed file:
awk '{print $5 " " $6 " " $7 " " $2"."$1"."$3}' infoV7gencodeGenes > bedFileTranscripts
chrX 99888438 99894988 TSPAN6.ENSG00000000003.9.ENST00000494424.1
chrX 99839798 99854882 TNMD.ENSG00000000005.5.ENST00000373031.4
chrX 99848620 99852528 TNMD.ENSG00000000005.5.ENST00000485971.1
chr20 49551403 49575087 DPM1.ENSG00000000419.7.ENST00000371588.5
You could load this file on the browser as a custom track now.
Step2: Use the downloadable binary hgGcPercent and 2bit file to extract the GC information, then use some commands to massage output.
2. Get the 2bit file:
3. Run the program on the bed file of interest:
./hgGcPercent -bedRegionIn=bedFileTranscripts -bedRegionOut=bedFileTranscripts.hgGcPercents -noLoad hg19.2bit
chrX 99848620 99852528 1545 gcCount
chr20 49551403 49575087 9438 gcCount
chr20 49551403 49575087 9438 gcCount
4. Strip out the percent:
awk '{ tmp=($4)/($3-$2); printf"%.04f\n", tmp}' bedFileTranscripts.hgGcPercents > justPercents
0.4278
0.4080
0.3953
0.3985
5. Add that to the earlier bed file
paste bedFileTranscripts justPercents > bedFileTranscripts.added
6. Create a version of what you were desiring:
awk '{ tmp2=($3-$2); printf"\n" $4 " " $5 " " tmp2 }' bedFileTranscripts.added
TSPAN6.ENSG00000000003.9.ENST00000494424.1 0.4278 6550
TNMD.ENSG00000000005.5.ENST00000373031.4 0.4080 15084
TNMD.ENSG00000000005.5.ENST00000485971.1 0.3953 3908
DPM1.ENSG00000000419.7.ENST00000371588.5 0.3985 23684
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to
gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genome Bioinformatics Group