help about conservation score computation

685 views
Skip to first unread message

WZ

unread,
Jan 17, 2014, 3:11:35 PM1/17/14
to gen...@soe.ucsc.edu
Dear Sir/Madam,

This is a postdoc fellow in Albert Einstein college of medicine. I recently want to calculate the conservation scores for the enhancers. That is, I have a sequence of positions in each chromosome, which are the enhancer regions.
The information includes the chromosome , start and end position. I want to get the (average)conservation score of each enhancer. Probably I need to get the score in every position within the region and sum them. Could you please
tell me how to compute the conservation score in each position using UCSC genome browser? The number of regions is very big, say, about 400000 pieces of regions. Can I accomplish this using the browser?

Thank you a lot!

Wen

Jonathan Casper

unread,
Jan 20, 2014, 2:05:20 PM1/20/14
to WZ, gen...@soe.ucsc.edu

Hello Wen,

Thank you for your question about obtaining average conservation scores. You can get the raw conservation data from the UCSC Table Browser, but you would have to average the scores on your own. We suggest instead that you use our command-line utilities or the tools at Galaxy (http://usegalaxy.org), both of which can do the score averaging for you.

If you are able to run our command-line utilities, we have a tool that may help you significantly: bigWigAverageOverBed. This tool takes as input a file in bigWig format (like the conservation data) and a list of genomic regions in BED format. The output file contains several columns of information about each region from the BED file, including the average value of the bigWig data in that region.

Information about downloading and running the bigWigAverageOverBed program can be found at http://hgdownload.soe.ucsc.edu/admin/exe/. We provide precompiled versions of the program for several computer architectures and source code to compile it yourself if needed.

bigWig files containing conservation data for the hg19 100-way track are available from our download server at http://hgdownload.soe.ucsc.edu. Follow the page down to the Human/hg19 section, and then look for the "Multiple Alignments" heading. The "Conservation Scores for alignments of 99 vertebrate genomes with Human" link leads to the files containing phastCons conservation scores, while "Basewise conservation scores (phyloP) of 99 vertebrate genomes with Human" leads to phyloP data. More information about the difference between these conservation measures is available on the 100-way track description page at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=cons100way.

You can create your own BED file of enhancer regions by creating a simple text file and listing each region on one line in the following format:

chromosome   start_position   end_position   name

Example:
chr1   22456789   22478309   my_region

The exact format for each of these fields is described on the BED format page at http://genome.ucsc.edu/FAQ/FAQformat.html#format1. Please make sure that you save this as a simple text file - it will not be recognized if saved as a Microsoft Word document or other, similar formats.

If you are unable to run our bigWigAverageOverBed program, we recommend that you use the tools at Galaxy (http://usegalaxy.org). You can obtain conservation scores in wiggle format from the same location as the bigWig data on our download server at http://hgdownload.soe.ucsc.edu (look for the wigFix.gz files). Upload the wiggle conservation data file and the BED file of your enhancer regions to Galaxy with the "Get Data" >> "Upload File" menu item. Next, open the "Get Genomic Scores" >> "Aggregate datapoints" tool. For "Interval file" select your BED region data, and for "Score Source" select the wiggle data that you uploaded. The tool should create output that shows the average score for each BED region.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group




--


Reply all
Reply to author
Forward
0 new messages