labels for bb file fields downloaded from Encode

41 views
Skip to first unread message

Tejas Shah

unread,
Feb 24, 2014, 11:06:44 AM2/24/14
to gen...@soe.ucsc.edu
Hi,

I downloaded the data for Uniform histone peaks from the Encode downloads page (hg19 analysis hub):


The link pointed to this directory:

http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/peaks/jan2011/histone_macs/optimal/hub/

I downloaded the bigbed files from here, and then converted a couple of them to bed format. After I converted them to beds, I realised the columns weren't described anywhere. I looked at the data format page (http://encodeproject.org/ENCODE/fileFormats.html), but it didn't seem to cover the actual formats in these files.

It's a similar case with the other uniform peak annotation .bb files.

Where would I find the column labels for these files?

cheers
Tejas

Brian Lee

unread,
Feb 24, 2014, 12:49:45 PM2/24/14
to Tejas Shah, gen...@soe.ucsc.edu
Dear Tejas,

Thank you for using the UCSC Genome Browser and your question about the file format for the Uniform Histone peaks in the AWG Hub.

Please see the ENCODE resources and FAQ page that includes helpful links and previously answered questions about formats: http://genome.ucsc.edu/ENCODE/FAQ/index.html

You may want to click the "UCSC Genome Browser ENCODE-specific File Formats" link, http://encodeproject.org/FAQ/FAQformat.html#ENCODE, on the file formats page you referenced to learn about the narrowPeak format.

At the top of the downloads page you referenced, you will see a link that allows you to visualize the AWG hub in the UCSC Browser: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/hub.txt

From the AWG directory you mentioned, there is a uniformHistone.html page, http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/peaks/jan2011/histone_macs/optimal/hub/uniformHistone.html, that is used to provide important information about this data when it is being viewed in the browser, but these .html files can also serve as documents describing the files you are seeing as you navigate the AWG downloads directory. (Please note that UCSC does not maintain the AWG files, they are maintained by external sources, therefore UCSC is not responsible for their content.)

If on the AWG page you referenced, you click up to the parent directory twice you will find a README.txt that shares a note about the narrowPeak format, http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/peaks/jan2011/histone_macs/README.txt: "NOTE: The Q-value column in the narrowPeak files is actually the global IDR score for each peak."

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group

Tejas Shah

unread,
Feb 25, 2014, 12:02:53 PM2/25/14
to gen...@soe.ucsc.edu

Hi Brian,

Thanks for the info. I've looked at the FAQ and file formats pages, and the visualisation page too, but I couldn't match up the data in the directory I was looking at (http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/peaks/jan2011/histone_macs/optimal/hub/) to one of the ENCODE specific bed formats. Looking at the narrowpeak format (http://genome.ucsc.edu/FAQ/FAQformat.html#format12), columns 4 and 5 are the name and score (0-1000). However, the name column in these files looks like it's another numeric field, and the score goes above 1000, e.g.:

chr1    1078882 1080351 16669   15179   .       15.54   4.808136        5.850519        957
chr1    1092819 1094549 10820   12098   .       13.07   5.874796        6.545509        1060
chr1    1141299 1142453 35929   32674   .       9.53    1.111986        2.089407        746
chr1    1147995 1149793 22666   32999   .       11.32   1.076765        2.052549        933

What could be the reason for this?

cheers
Tejas

Brian Lee

unread,
Feb 25, 2014, 12:12:41 PM2/25/14
to Tejas Shah, gen...@soe.ucsc.edu
Dear Tejas,

Thank you for using the UCSC Genome Browser and taking the time to provide an example regarding the interpretation of the AWG narrowPeak files.

The ENCODE resources and FAQ page includes previously answered questions about formats, such as this one that may be of help: http://genome.ucsc.edu/ENCODE/FAQ/index.html#release11

Provided example:
chr1 1078882 1080351 16669 15179 . 15.54 4.808136 5.850519 957

You are correct in your interpretation of the fields, it is just unusual and confusing that the name is a number (16669 ) in this example, and that while scores are cut off at 1000 in the display, they can go above that threshold (15179).

By navigating to the AWG hub,http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/hub.txt, to this position at chr1 1078882 1080351, and displaying only the Histone Modification Track, if you click the first item (which fortuitously happens to be your example) you will see the following displayed on the details page:

Item: 16669
Score: 15179
Position: chr1:1078883-1080351
Band: 1p36.33
...
Measurement of overall (usually, average) enrichment for the region: 15.54
Measurement of statistical significance (-log10, -1 if no pValue is assigned): 4.808136
Measurement of statistical significance using false discovery rate (-log10, -1 if no qValue is assigned): 5.850519
Point-source called for this peak; 0-based offset from chromStart (-1 if no point-source called): 957

The AWG hub, in this way, can provide a very useful way to visualize the data and provide explanations. Please especially review the track description page by scrolling down below the details. While you are correct the browser displays scores from light to dark in the range of 0-1000, scores actually can go much higher, with values exceeding 1000 capped at 1000 in terms of display purposes.

Tejas Shah

unread,
Feb 26, 2014, 11:31:05 AM2/26/14
to Brian Lee, gen...@soe.ucsc.edu
Hi Brian,

Thanks! Looking at the tracks in the browser was really helpful

cheers
Tejas
Reply all
Reply to author
Forward
0 new messages