question on wgEncodeRegTfbsClusteredV3

35 views
Skip to first unread message

Dong, Xianjun

unread,
Jan 30, 2018, 12:46:08 PM1/30/18
to gen...@soe.ucsc.edu
Hi, 

I’m at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegTfbsClustered/ to extract the TF binding peaks for a specific cell type. According to the description page, 

For the V3/V4 releases, a new track table format, 'factorSource' was used to represent the primary clusters table and downloads file, wgEncodeRegTfbsClusteredV3. This format consists of standard BED5 fields (see File Formats) followed by an experiment count field (expCount) and finally two fields containing comma-separated lists. The first list field (expNums) contains numeric identifiers for experiments, keyed to the wgEncodeRegTfbsClusteredInputsV3 table, which includes such information as the experiment's underlying Uniform TFBS table name, factor targeted, antibody used, cell type, treatment (if any), and laboratory source. The second list field (expScores) contains the scores for the corresponding experiments. For convenience, the file downloads directory for this track also contains a BED file, wgEncodeRegTfbsClusteredWithCellsV3, that lists each cluster with the cluster score followed by a comma-separated list of cell types.

But when I used the 7th column (expNum) in wgEncodeRegTfbsClusteredV3.bed as key to locate the corresponding lines in the wgEncodeRegTfbsClusteredInputsV3.tab, I don’t see the same TF. For example, for the 1st line in wgEncodeRegTfbsClusteredV3 below, it’s for TF ZBTB33 and the 7th column is 204,246. I searched the 204th and 246th line in wgEncodeRegTfbsClusteredInputsV3, it returns experiment for TF YY1, not ZBTB33. 

Please instruct how I can extract all TF ChIPseq peaks for a specific cell line, e.g. SK-N-SH. Thanks

[xd010@eris1n2 TFBS]$ zcat wgEncodeRegTfbsClusteredV3.bed.gz | head
chr1 10073 10329 ZBTB33 354 2 204,246 354,138
chr1 10149 10413 CEBPB 201 1 343 201
chr1 16110 16390 CTCF 227 7 213,612,621,627,628,631,662 110,139,171,209,227,200,170
chr1 29198 29688 TAF1 184 1 157 184
chr1 29275 29591 GABPA 198 1 180 198
chr1 89795 90051 USF1 185 2 68,202 146,185
chr1 91156 91580 CTCF 223 9 9,640,645,657,663,669,672,675,685 183,223,138,115,115,145,141,144,220
chr1 104859 105089 CTCF 106 2 48,49 87,106
chr1 138850 139274 CTCF 166 3 24,213,679 166,125,127
chr1 235541 235877 SP1 120 1 196 120

[xd010@eris1n2 TFBS]$ zcat wgEncodeRegTfbsClusteredInputsV3.tab.gz | head
wgEncodeAwgTfbsBroadDnd41CtcfUniPk Dnd41+Broad+CTCF CTCF CTCF Dnd41 None Broad
wgEncodeAwgTfbsBroadDnd41Ezh239875UniPk Dnd41+Broad+EZH2_(39875) EZH2 EZH2_(39875) Dnd41 None Broad
wgEncodeAwgTfbsBroadGm12878CtcfUniPk GM12878+Broad+CTCF CTCF CTCF GM12878 None Broad
wgEncodeAwgTfbsBroadGm12878Ezh239875UniPk GM12878+Broad+EZH2_(39875) EZH2 EZH2_(39875) GM12878 None Broad
wgEncodeAwgTfbsBroadH1hescChd1a301218aUniPk H1-hESC+Broad+CHD1_(A301-218A) CHD1 CHD1_(A301-218A) H1-hESC None Broad
wgEncodeAwgTfbsBroadH1hescCtcfUniPk H1-hESC+Broad+CTCF CTCF CTCF H1-hESC None Broad
wgEncodeAwgTfbsBroadH1hescEzh239875UniPk H1-hESC+Broad+EZH2_(39875) EZH2 EZH2_(39875) H1-hESC None Broad
wgEncodeAwgTfbsBroadH1hescJarid1aab26049UniPk H1-hESC+Broad+JARID1A_(ab26049) KDM5A JARID1A_(ab26049) H1-hESC None Broad
wgEncodeAwgTfbsBroadH1hescRbbp5a300109aUniPk H1-hESC+Broad+RBBP5_(A300-109A) RBBP5 RBBP5_(A300-109A) H1-hESC None Broad
wgEncodeAwgTfbsBroadHelas3CtcfUniPk HeLa-S3+Broad+CTCF CTCF CTCF HeLa-S3 None Broad

[xd010@eris1n2 TFBS]$ zcat wgEncodeRegTfbsClusteredInputsV3.tab.gz | sed -n '204p;246p;247q'
wgEncodeAwgTfbsHaibHepg2Yy1sc281V0416101UniPk HepG2+HudsonAlpha+YY1_(SC-281) YY1 YY1_(SC-281) HepG2 None HudsonAlpha
wgEncodeAwgTfbsHaibK562Yy1V0416102UniPk K562+HudsonAlpha+YY1 YY1 YY1 K562 None HudsonAlpha

Thanks,

Xianjun Dong, PhD
----------------------------------------------------
Director of Computational Neuroscience
Neurogenomics Laboratory and Parkinson Personalized Medicine 
Brigham and Women's Hospital

Instructor in Neurology, Harvard Medical School

Building for Transformative Medicine, 
60 Fenwood Road, 9002EE
Boston, MA 02115


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

Brian Lee

unread,
Jan 30, 2018, 6:24:23 PM1/30/18
to Dong, Xianjun, gen...@soe.ucsc.edu

Dear Xianjun,

Thank you for using the UCSC Genome Browser and your question about the wgEncodeRegTfbsClusteredInputsV3.tab file.

Here is a session link to visualize the one referenced ZBTB33 item (at chr1 10073 10329): http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg19.ZBTB33

If you click into the item you will see two boxes that reference the inputs, that you were sharing are seen in the 7th column as 204,246.

It turns out the line numbers are off by 1 because the experiments start at 0. Experiment 0 is on line 1 of wgEncodeRegTfbsClusteredInputsV3, so experiment 204 is on line 205 and experiment 246 is on line 247:

$ cat wgEncodeRegTfbsClusteredInputsV3.tab | sed -n '205p;247p'
wgEncodeAwgTfbsHaibHepg2Zbtb33Pcr1xUniPk HepG2+HudsonAlpha+ZBTB33 ZBTB33 ZBTB33 HepG2 None HudsonAlpha
wgEncodeAwgTfbsHaibK562Zbtb33Pcr1xUniPk K562+HudsonAlpha+ZBTB33 ZBTB33 ZBTB33 K562 None HudsonAlpha

You have a couple approaches to extract all TF ChIPseq peaks for a specific cell line, e.g. SK-N-SH. Likely the best choice is to go to the source data rather than trying to extract it from this clustered summary track. On the Track Description page where you referenced the other note about the Input file there is a link on the third paragraph from the top about data being "available from the ENCODE Uniform TFBS track."

Clicking to this other collection of ENCODE Uniform TFBS tracks you will see a very large matrix, where you can select Factor on each row and Cell Line for each column. The first thing to do is click the very top left corner [-] to unselect all tracks for all factors and cell lines, and then click the [+] for the SK-N-SH column to select all those tracks. Here is a session with that selection made: http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg19.SKNSH

The files for these ENCODE Uniform TFBS track data can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/

There is a metadata file called files.txt where you can find the lines that have "cell=SK-N-SH" to find the related file names and md5sums.

Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further questions and reply to gen...@soe.ucsc.edu messages will be archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UC Santa Cruz Genomics Institute

Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/C18AF818-FAD0-4C1D-8AA4-E72C32D3DC1C%40rics.bwh.harvard.edu.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

sterding

unread,
Jan 31, 2018, 11:32:38 AM1/31/18
to Brian Lee, Xianjun Dong, gen...@soe.ucsc.edu
Dear Brian,

Thanks for prompt help. It’s good to know that the experiment num (expNum) start at 0.

Two following questions:

1. How is the cell types (6th column) in the wgEncodeRegTfbsClusteredWithCellsV3.bed coded? e.g. what’s the numeric ID for SK-N-SH?

2. Another interesting thing is: I can only find 9 different TFs in SK-N-SH in wgEncodeRegTfbsClusteredInputsV3 (e.g. "zgrep SK-N-SH wgEncodeRegTfbsClusteredInputsV3.tab.gz | cut -f3 | sort -u | wc -l”). According to the description page, V3 is released on August 2013, which should already include ENCODE 2012 Sep Freeze. On the encodeproject.org website, I can see till ENCODE 2012 September Freeze, it has at least 36 experiment. See "ENCODE2: 36; ENCODE3: 6" on the left sidebar of the following page:
https://www.encodeproject.org/search/?type=Experiment&searchTerm=sk-n-sh&target.investigated_as=transcription+factor&limit=all
So my question is: where are the other TF ChIPseq for SK-N-SH in the wgEncodeRegTfbsClusteredWithCellsV3 (or ENCODE Uniform TFBS tracks you pointed)?
I actually found in UCSC ENCODE Experiment Matrix (https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html), there are 34 TF ChIP-seq for SK-N-SH. If you click the number “34”, you will see the list of TF experiments for SK-N-SH:
https://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=654776117_aUCyoB8pS8LrfeINho3Z8Azw4Ntd&hgt_=1517365641&db=hg19&tsCurTab=advancedTab&hgt_tsDelRow=&hgt_tsAddRow=&hgt_tsPage=&tsSimple=&tsName=&tsDescr=&tsGroup=Any&tsType=Any&hgt_mdbVar1=dataType&hgt_mdbVal1=ChipSeq&hgt_mdbVar2=cell&hgt_mdbVal2=SK-N-SH&hgt_mdbVar3=view&hgt_mdbVal3=Peaks&hgt_tSearch=search
Some of the files can be downloaded here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/
I am wondering why they are not included in the wgEncodeRegTfbsClusteredWithCellsV3 or ENCODE Uniform TFBS tracks?

Thanks,
-Xianjun
Reply all
Reply to author
Forward
0 new messages