question about format of one of your ENCODE files

115 views
Skip to first unread message

Eric Foss

unread,
Nov 6, 2014, 1:46:44 PM11/6/14
to gen...@soe.ucsc.edu
Dear UCSC Genome Browser, 

I would like to download transcription factor binding site data from ENCODE. I found promising files here, though the description of what the files contained was extremely limited: 


I downloaded this file: 
wgEncodeRegTfbsClusteredV3.bed.gz
It is described as follows: wgEncodeRegTfbsClusteredV3.bed.gz - TFBS clusters (V3) from ENCODE data uniformly processed by the ENCODE Analysis Working Group (BED 5+2 'factorSource' format. Has 2 list fields for cell ids and scores. See track description at link above for details)

Clicking on the link mentioned in the last sentence of the description didn't help in understanding the file format.

From the .bed file name, I assumed that this was a file in bed format:


Here are the first few lines:

chr1 10073 10329 ZBTB33 354 2 204,246 354,138 chr1 10149 10413 CEBPB 201 1 343 201 chr1 16110 16390 CTCF 227 7 213,612,621,627,628,631,662 110,139,171,209,227,200,170 chr1 29198 29688 TAF1 184 1 157 184 chr1 29275 29591 GABPA 198 1 180 198 chr1 89795 90051 USF1 185 2 68,202 146,185 chr1 91156 91580 CTCF 223 9 9,640,645,657,663,669,672,675,685 183,223,138,115,115,145,141,144,220 chr1 104859 105089 CTCF 106 2 48,49 87,106 chr1 138850 139274 CTCF 166 3 24,213,679 166,125,127 chr1 235541 235877 SP1 120 1 196 120 chr1 235548 235792 EGR1 154 1 216 154 chr1 235734 235849 FOXA1 427 2 177,178 427,370

The first 5 columns match the bed file format, but then the 6th is supposed to be strand, but clearly isn't, and then I don't know at all what the 7th and 8th columns are. Searching your web site, wiki and mailing list didn't help and nor could I find information about this on the ENCODE project site. Can you please let me know what this file format is?

Thank you.

Eric

Brian Lee

unread,
Nov 6, 2014, 3:13:48 PM11/6/14
to Eric Foss, gen...@soe.ucsc.edu

Dear Eric,

Thank you for using the UCSC Genome Browser and your question about the wgEncodeRegTfbsClustered track and obtaining transcription factor binding site data from ENCODE.

If you click the referenced track description, http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeRegTfbsClusteredV3, under the "Release Notes" section you will find a paragraph about the new "factorSource" track table format:

For the V3/V4 releases, a new track table format, 'factorSource' was used to represent the primary clusters table and downloads file, wgEncodeRegTfbsClusteredV3. This format consists of standard BED5 fields (see File Formats) followed by an experiment count field (expCount) and finally two fields containing comma-separated lists. The first list field (expNums) contains numeric identifiers for experiments, keyed to the wgEncodeRegTfbsClusteredInputsV3 table, which includes such information as the experiment's underlying Uniform TFBS table name, factor targeted, antibody used, cell type, treatment (if any), and laboratory source. The second list field (expScores) contains the scores for the corresponding experiments.

Also from the Track Description page you can click the "View table: schema" button and see an example row from wgEncodeRegTfbsClusteredV3:

chr1    10073   10329   ZBTB33  354     2       204,246 354,138

The 2 here for expCount represents that the ZBTB33 cluster at chr1:10074-10329 originates from two experiments. Those experiments are identified by the expNums 204, 246 found in wgEncodeRegTfbsClusteredInputsV3. The scores for those two experiments is 354 and 138. (Note in the Track Description there is a line sharing, "The cluster score is the highest score for any peak contributing to the cluster," explaining why field five has 354.)

In essence, if you are not interested in this information, you could disregard these final columns, they are metadata explaining what cell types and relative scores support the evidence of the factor binding at the coordinates given. For example, if you navigate to chr1:10074-10329, with this track displayed at "full", you will see the ZBTB33 item, that you can click and then see the following details:

#    signal    abr    cellType    factor    antibody    treatment    lab    more info
1    354.00    L    HepG2    ZBTB33    ZBTB33    None    HudsonAlpha     metadata
2    138.00    K    K562    ZBTB33    ZBTB33    None    HudsonAlpha     metadata

If you click the "metadata" links, these final columns you are asking about share that these data originate from wgEncodeAwgTfbsHaibHepg2Zbtb33Pcr1xUniPk and wgEncodeAwgTfbsHaibK562Zbtb33Pcr1xUniPk (through the wgEncodeRegTfbsClusteredInputsV3 file mentioned above).

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee


--


Reply all
Reply to author
Forward
0 new messages