Help downloading .fastq files from ENCODE

Dallas Nygard

unread,

Jan 27, 2022, 7:10:12 PM1/27/22

to 3D Genomics

I am a PhD student at the University of Ottawa researching functional implications of chromatin spatial arrangement, and was hoping to ask you a few questions regarding the HiC data generated by your group for the publication Cohesin loss eliminates all loop domains (Rao et al. 2018). I am looking to use the data you have made available via ENCODE for my research, but have run into some issues trying to parse the various file names, ENCODE accessions, replicates, and treatment conditions. What I’m looking for are the raw sequencing files (.fastq) for the untreated, treated, and treated+180min withdrawal samples that were used in figure 2. Going off of the supplemental spreadsheet that was included with the paper (attached), I believe the cases (libraries?) I need are as follows:

Rao-2017-HIC001

Rao-2017-HIC002

Rao-2017-HIC003

Rao-2017-HIC004

Rao-2017-HIC005

Rao-2017-HIC006

Rao-2017-HIC007

Rao-2017-HIC008

Rao-2017-HIC009

Rao-2017-HIC010

Rao-2017-HIC011

Rao-2017-HIC012

Rao-2017-HIC013

Rao-2017-HIC014

Rao-2017-HIC044

Rao-2017-HIC045

Rao-2017-HIC046

Rao-2017-HIC047

Looking through the fastq files available on ENCODE, it’s not clear to me which files correspond with which cases outlined here. I am looking at the files at this link https://www.encodeproject.org/experiments/ENCSR152HRS/ , and although I can see that these files have names that seem to match the conventions of the supplementary file, All 80 of the provided files are labelled either HIC045 or HIC046. Overall, I’m just wondering if you could provide some guidance for how to relate the files available for download with the Library IDs provided in the supplemental table. Any help with this matter would be much appreciated.

Furthermore, I was wondering how exactly your group went about amalgamating these raw sequencing files. For example, Rao-2017-HIC001 through Rao-2017-HIC007 are spread across two replicates for the same treatment condition, but have all of their reads summed in the “TOTAL” row. So my question is, were all seven of these files combined for the analysis and creation of Figure 2? Were the replicates kept separate at all, or were all of the reads combined to get the maximum number of reads/ contacts in the final analysis?

Thank you for your time in reading this and helping me with my questions. If you have any clarifying questions or comments for me, please do not hesitate to reach out.

3D Genomics

unread,

Feb 10, 2022, 11:58:38 AM2/10/22

to 3D Genomics

Hi Dallas,

You can find all of the fastqs you requested on GEO labeled with the corresponding sample IDs listed in the supplemental spreadsheet: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104333

For all of the analyses in the main figures, reads corresponding to biological replicates were combined to create maps with the maximum number of reads possible.

Best,

Suhas

Dallas Nygard

unread,

Feb 11, 2022, 3:14:19 AM2/11/22

to 3D Genomics

Perfect, thank you for looking into this for me! I originally was not finding the fastq files on GEO, but I see where they are now. Thank you additionally for the information about combining the replicates; this is good to know.