Data integrator: How to extract feature values from more than 5 data sources

14 views
Skip to first unread message

Mohan C M

unread,
Sep 26, 2016, 12:43:25 PM9/26/16
to gen...@soe.ucsc.edu
Hi,
I want to extract the methylation status of all the CpG sites located around an exon of Estrogen receptor alpha (GRCh37/hg19, chr6:152,124,219-152,168,456) form Whole Genome Bisulfite Sequencing (WGBS) data. The data source I will be using is a track hub "Roadmap Epigenomics Data Complete Collection at Wash U VizHub" with "DNA methylation" track containing WGBS data for many tissue samples. When I tried "Data Integrator" tool from genome browser I could extract the values form only five samples at a time.
Is it possible to get the feature values for all the samples in a single-shot?
If not, any alternative ways other than extracting from 5 data sources at a time?

Best regards
Mohan

Christopher Lee

unread,
Oct 4, 2016, 11:44:02 AM10/4/16
to Mohan C M, UCSC Genome Browser Discussion List

Hi Mohan,

Thank you for your question about using the Data Integrator to extract information from more than 5 data sources at a time. Unfortunately this is not currently possible with the Data Integrator, but if you are up for some scripting it may be possible. This will require a Unix like command line and the UCSC utility bigWigToBedGraph. You can download the appropriate bigWigToBedGraph for your OS here:
http://hgdownload.soe.ucsc.edu/admin/exe/

The first step is to get a handle on the data we will be examining. Save the Roadmap trackDb.txt as a local file:
curl -o trackDb_dli_edacc9_3.txt http://vizhub.wustl.edu/VizHub/hg19/trackDb_dli_edacc9_3.txt

Now examine the relevant lines of the files:
grep -E 'shortLabel |bigDataUrl ' trackDb_dli_edacc9_3.txt | less

Once inside of less, search for DNA Methylation (the name of the track of interest). You should see lines like the following:

shortLabel DNA Methylation
    shortLabel MeDIP_Coverage
        bigDataUrl GSM941726.bigWig
        shortLabel PFK MeDIP 02 92
        bigDataUrl GSM941727.bigWig
        shortLabel PFM MeDIP 02 93

Each bigDataUrl line is the name of a file located at http://vizhub.wustl.edu/VizHub/hg19/. For example, the first bigDataUrl is located at:
http://vizhub.wustl.edu/VizHub/hg19/GSM941726.bigWig

You can then extract data from this file with the utility bigWigToBedGraph:
bigWigToBedGraph -chrom=chr6 -start=152124218 -end=152168456 http://vizhub.wustl.edu/VizHub/hg19/GSM941726.bigWig GSM941726_ESR1.bedGraph

This line creates a bedGraph file with the following output:

chr6    152124218       152124220       4
chr6    152124220       152124240       1
chr6    152124240       152124440       0
chr6    152124440       152124480       2
chr6    152124480       152124520       3
...
...
...

You can then write a script that does three things:
- makes the correct URLs from the relevant bigDataUrl fields
- for each url for each desired region, use bigWigToBedGraph to create a corresponding bedGraph file
- for each bedGraph file created in step 2, aggregate the score field

Please keep in mind that coordinates in UCSC tools and almost all UCSC file formats are 0-based, half open; start is 0-based and end is 0-based but points to the base after the end of the region. So 1-based fully-closed chr6:152,124,219-152,168,456 becomes -chrom=chr6 -start=152124218 -end=152168456 for UCSC command line tools.

The advantage of creating this script is that you can run this script for any region of interest, and examine data from anywhere in the Roadmap hub.

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Reply all
Reply to author
Forward
0 new messages