Processing HiChip data

dborg...@gmail.com

unread,

Mar 20, 2017, 7:52:46 PM3/20/17

to HiC-Pro

Hello,
I'm trying to process HiChip data and seem to be running into a problem.
I seem to be losing the vast majority of my valid pairs when assigning pairs to the target regions:

smc1a_DMSO_allValidPairs.mergestat
valid_interaction    4266330
valid_interaction_rmdup    4097692
trans_interaction    497615
cis_interaction    3600077
cis_shortRange    1053475
cis_longRange    2546602
valid_interaction_ontarget    135279

As you can see, vast majority of pairs are dropped because they do not fall "on target"
I am using Chip-Seq broad peaks (called using MACS) of the factor which was pulled down, smc1a in this case, as the "CAPTURE TARGET" parameter, is this correct?

I am providing this config file:
#########################################################################
## Paths and Settings - Do not edit !
#########################################################################

TMP_DIR = tmp
LOGS_DIR = logs
BOWTIE2_OUTPUT_DIR = bowtie_results
MAPC_OUTPUT = hic_results
RAW_DIR = rawdata

#######################################################################
## SYSTEM AND SCHEDULER - Start Editing Here !!
#######################################################################
N_CPU = 8
LOGFILE = hicpro.log

JOB_NAME = hic_test
JOB_MEM = 15
JOB_WALLTIME =
JOB_QUEUE =
JOB_MAIL =

#########################################################################
## Data
#########################################################################

PAIR1_EXT = _R1
PAIR2_EXT = _R2

#######################################################################
## Alignment options
#######################################################################

FORMAT = phred33
MIN_MAPQ = 0

BOWTIE2_IDX_PATH = /input_data
BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder
BOWTIE2_LOCAL_OPTIONS = --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder

#######################################################################
## Annotation files
#######################################################################

REFERENCE_GENOME = hg19_ref
GENOME_SIZE = chrom_hg19.sizes
CAPTURE_TARGET = /hic_in/chip-seq/smc1a_merged_broadPeaks_1kbslop.bed

#######################################################################
## Allele specific analysis
#######################################################################

ALLELE_SPECIFIC_SNP =

#######################################################################
## Digestion Hi-C
#######################################################################

GENOME_FRAGMENT = hg19_MboI.bed
LIGATION_SITE =
MIN_FRAG_SIZE =
MAX_FRAG_SIZE =
MIN_INSERT_SIZE =
MAX_INSERT_SIZE =

#######################################################################
## Hi-C processing
#######################################################################

MIN_CIS_DIST = 1000
GET_ALL_INTERACTION_CLASSES = 1
GET_PROCESS_SAM = 0
RM_SINGLETON = 1
RM_MULTI = 1
RM_DUP = 1

#######################################################################
## Contact Maps
#######################################################################

BIN_SIZE = 1000 10000 40000 150000 1000000
MATRIX_FORMAT = upper

#######################################################################
## Normalization
#######################################################################
MAX_ITER = 100
FILTER_LOW_COUNT_PERC = 0.02
FILTER_HIGH_COUNT_PERC = 0
EPS = 0.1

--Thank you
--Diego

nservant

unread,

Mar 21, 2017, 8:27:11 AM3/21/17

to HiC-Pro

Hi Diego
Thank you very much your interesting post.
Actually, I never really analyzed HiChip data myself. My feeling is that you should not specifiy a target BED.

First, this parameter has been designed for capture-Hi-C data. It means that only cis interactions from which both interactors overlap with the target are kept by default.
So in you case, only valid 3C products with R1 + R2 falling within a peak will be used. If you have an interaction beteen a ChIPseq peak, and another region, it will be filtered out.
If you want to change that. I already designed a parameter to specify if the read pairs / or only one of the two reads, have to interact with the target.
This is the '-p' parameter of the onTarget.py script. You can try to remove this "-p" in the script merge_valid_interactions.sh, line 95, to see if the results are better.
Sorry, there is no simplier way to turn off this option so far.

Second, If you look at the Mumbach et al. paper, their pictures really look like Hi-C data. I don't know if it really make sense to only look at interactions that overlap with the peaks.
In HiC (and I guess in HiChip too), you might expect a enrichment around the restriction sites, rigth ? so maybe not necessarily exactly at the ChIP-seq peak location ?

I do not know if someone else have feedbacks on that.
Anyway, if you think that I could add some options in the configuration file (like removing the -p parameter), feel free to ask. I'll be happy to improve HiC-Pro for HiChip processing.
Best
Nicolas

dborg...@gmail.com

unread,

Mar 21, 2017, 12:23:29 PM3/21/17

to HiC-Pro

Hey Nicolas,

So I think getting HiC-Pro to automatically run on HiChip should actually be relatively simple. I think you should be able to hack it a bit to get it working, namely:

1) As you said, add an option to disable the -p parameter, allowing for fragments which have only one end mapping within a target fragment.

2) Add an input option which takes in a bed file of chip-seq peaks, it then overlaps this file w/ the restriction fragment file, this assigns peaks to each fragment, thereby giving a list of fragments which are expected to show up in one or two ends of the sequencing data. You can also automatically calculate the peak regions by using the self-ligated reads and dangling ends (thereby keeping the analysis internally consistent and automatic), i think this option would be best, and pretty trivially implemented because you already determine all the fragment classes, just have to concatenate the two and run MACS, then overlap the peaks w/ your fragment file.

3) Normalization, this is a bit trickier, ICE normalization protocol for this type of data should not work, you cannot work on the equal coverage assumption, therefore a new normalization strategy should be implemented, i'm not sure which one exactly off the top of my head.

If those three things are done your pipeline should be fully capable to analyze HiChiP data from beginning to end with no real modifications, this would be enormous. If you are able to do this, I can guarantee you'll get a ton more people using it, your pipeline is currently the most accessible in terms of setup and documentation and HiChip is definitely getting way more popular.

--Diego

nservant

unread,

Mar 21, 2017, 2:23:37 PM3/21/17

to HiC-Pro

Hi Diego,

Thank you for your feedbacks.
Please find below my comments

On Tuesday, 21 March 2017 17:23:29 UTC+1, dborg...@gmail.com wrote:

Hey Nicolas,

So I think getting HiC-Pro to automatically run on HiChip should actually be relatively simple. I think you should be able to hack it a bit to get it working, namely:

1) As you said, add an option to disable the -p parameter, allowing for fragments which have only one end mapping within a target fragment.

ok. That's easy

2) Add an input option which takes in a bed file of chip-seq peaks, it then overlaps this file w/ the restriction fragment file, this assigns peaks to each fragment, thereby giving a list of fragments which are expected to show up in one or two ends of the sequencing data. You can also automatically calculate the peak regions by using the self-ligated reads and dangling ends (thereby keeping the analysis internally consistent and automatic), i think this option would be best, and pretty trivially implemented because you already determine all the fragment classes, just have to concatenate the two and run MACS, then overlap the peaks w/ your fragment file.

I see your point. But I think that this is quite easy to do. And I'm not sure that there is a big interest in adding it to the core of HiC-Pro. For instance, you do not need HiC-pro to find the restriction fragments that overlap with your peaks ... BEDTools can do it in a much more efficient way. And I do not want to add any new dependency, like MAC2. But I will keep that point in mind and think about it.

3) Normalization, this is a bit trickier, ICE normalization protocol for this type of data should not work, you cannot work on the equal coverage assumption, therefore a new normalization strategy should be implemented, i'm not sure which one exactly off the top of my head.

ok. if you have a new method, we can think about adding it.

If those three things are done your pipeline should be fully capable to analyze HiChiP data from beginning to end with no real modifications, this would be enormous. If you are able to do this, I can guarantee you'll get a ton more people using it, your pipeline is currently the most accessible in terms of setup and documentation and HiChip is definitely getting way more popular.

Thank you. I'll start by the '-p' option, which is easy. And think about your other idea. It's true that we have many users that run HiC-Pro on HiChip data (including the original paper I think), so it would be important to answer their need.
best

Reply all

Reply to author

Forward