[Genome] Calculating GC percent for regions in .bed format

545 views
Skip to first unread message

Alex Nord

unread,
Mar 9, 2009, 11:49:55 AM3/9/09
to gen...@soe.ucsc.edu
Hello,

I would like to calculate the average GC percent for a series of human
genome regions found in a .bed file. Downloading the 5bp window GC
data from the table browser is not viable due to the number of bases
in each region (up to 1mb) and the number of regions (>1million). I
have downloaded the raw data in the table hg18.gc5Base.txt, and I am
wondering if there is a utility already designed to perform this
task. I looked at other postings, and it seems like the hggcpercent
tool might work, but I am new to using the genome data and couldn't
find the specific documentation/download directions.

Is the hggcpercent tool a good fit for what I am trying to
accomplish? Can I use it as a stand alone utility? Is there
straightforward documentation available?

Please consider that I am not a strong programmer, though comfortable
with short scripts and shell work and happy to learn.

Thanks much,

Alex

Alex Nord
Department of Genome Sciences
University of Washington





Jennifer Jackson

unread,
Mar 9, 2009, 11:25:00 PM3/9/09
to Alex Nord, gen...@soe.ucsc.edu
Hello,
hgGcPercent would be a good choice although you will need to do some
simple programming to finalize the calculations. The file you mention
(hg18.gc5Base.txt) is one type of output from this script, so it will
not be your starting place. You will start with your BED file of genomic
regions and the genomic sequence itself.

Input: 3 column BED file (strand is ignored).

Output: 5 column BED file where column 4 is the actual count of g/c
bases in the region. This is where you would need to write a quick
script to get the length from columns 2 and 3, then calculate % from there.

Example:
$ hgGcPercent -bedRegionIn=test.in.bed -bedRegionOut=test.in.bed.out
-noLoad hg18 /gbdb/hg18/nib
$ head -2 test.in.bed*
==> test.in.bed <==
chr1 1115 4121 1115 1115 +
chr1 1115 4272 1115 1115 +

==> test.in.bed.out <==
chr1 1115 4121 1680 gcCount
chr1 1115 4272 1764 gcCount

You will need the .2bit file from our Downloads area (by ftp - this is
the /gbdb/hg18/nib in my example).
You will need to download the whole source tree and have MySQL client
libraries installed. You do not have to build the entire tree -- after
completing steps 1 through 4 in kent/src/product/README.building.source,
the minimum you need to do is:

cd hg/makeDb/hgGcPercent
make

Instructions for obtaining UCSC Browser code source:
http://genome.ucsc.edu/FAQ/FAQdownloads#download27
List of our utilities:
http://genomewiki.cse.ucsc.edu/index.php/Kent_source_utilities
Location for downloading data: http://hgdownload.cse.ucsc.edu/downloads.html
MySQL client libraries are free and available from many web sources.

If these seems overly complicated and you do not plan on using our other
utilities, another option is to load your BED file into the Table
Browser (using region: define region) and extract the fasta sequence
from the genomic (using the Mapping and Sequencing Tracks, Assembly,
table = gold) with named, downloaded output. A fairly simple script
could move through each sequence, pop off the bases, and do the
calculations.

I hope this gives you some choices that will work for you,
Jennifer Jackson
UCSC Genome Bioinformatics Group
> _______________________________________________
> Genome maillist - Gen...@soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
Reply all
Reply to author
Forward
0 new messages