Batch query of conservation scores

157 views

Skip to first unread message

Sergei Manakov

unread,

Jun 5, 2015, 5:52:52 PM6/5/15

to gen...@soe.ucsc.edu

Hello,

Sorry if this was asked before, but what is a good way to get per-nucleotide conservation scores for a region? We have a few thousand regions of 10 to 20 bp, so we are looking to make this query as a batch.

thanks,
Sergei

Sergei (Siarhei Manakou) Manakov

California Institute of Technology
MC 147-75

land: +1 626 395 3593

mobile: + 1 858 729 4531

Matthew Speir

unread,

Jun 8, 2015, 3:55:25 PM6/8/15

to Sergei Manakov, gen...@soe.ucsc.edu

Hi Sergei,

Thank you for your question about obtaining base-wise conservation
scores for a number of regions. There are a few different ways you can
do this using the tools we provide at the UCSC Genome Browser. The first
method is using the Table Browser to extract the scores from a list of
regions. The only disadvantage to this method is that you can only
submit 1000 regions at a time, so if you have more than 1000 regions you
will have to re-run the query multiple times with different sets of
regions. To use the Table Browser to get this information, use the
following steps:

1. Navigate to the Table Browser, .
2. Make the following selections:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Comparative Genomics
track: Conservation
table: 100 Vert. Cons (phyloP100wayAll)
output: data points
output file: enter a file name to save your results to a file, or
leave blank to display results in your browser

(Note: I've used hg19 and the 100-way alignment in this example, but you
can choose any species that has a n-way alignment and a phyloP track.)

3. Click "define regions".
4. Input up to 1000 regions in BED 3 or 4 format, .
5. Click "submit".
6. Click "get output".
7. Repeat for groups of up to 1000 regions until you've queried them all.

You can also use a few of our different command line tools to get this
information. All our command line tools are available under the
directory for your system here: http://hgdownload.soe.ucsc.edu/admin/exe/.

The first tool that you can use is "hgWiggle". This tool does require
that you set up access to our public MySQL server using an "hg.conf"
file described here: http://genome.ucsc.edu/goldenPath/help/mysql.html.
Run hgWiggle on the command line without any arguments to see the usage
message. You can use the "-bedFile" option to define a list of regions
that you want output for.

Another option would be to download a copy of the bigWig file,
http://genome.ucsc.edu/goldenPath/help/bigWig.html, that contains the
base-wise conservation data and use our utility "bigWigToWig" to get the
scores for those regions. You can find the bigWig files for this type of
data on our download server,
http://hgdownload.soe.ucsc.edu/downloads.html, in the appropriate
section for each species under the link that begins with "Basewise
conservation scores (phyloP)". You can then take a bed file of your
regions, and use a loop to plug the values into bigWigToWig's -chrom,
-start, and -end options and then append the output for each region to a
new wiggle file.

I hope this is helpful. If you have any further questions, please reply
to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group