Arima Hi-C - processing Hi-C samples made with multiple restriction enzymes

2,857 views
Skip to first unread message

Vivek

unread,
Apr 18, 2018, 5:34:34 PM4/18/18
to 3D Genomics
Hello all,
I am trying out this Hi-C kit from Arima genomics. It contains multiple restriction enzymes cocktail for the sample prep. Could you let me know how I need to modify the Juicer script to account for this?

The two cut site motifs are: ^GATC and G^ANTC. The ‘^’ represents where the RE cuts on the + strand. The ’N’ in the second cut site motif can be either of the 4 genomic bases. 

I guess the for signatures of the ligation junctions within the chimeric reads, there are the following 4 possibilities:

GATC-GATC
GANT-GATC
GANT-ANTC
GATC-ANTC


Any suggestions will be helpful and appreciated!

Thanks
Viv

Neva Durand

unread,
Apr 19, 2018, 10:01:29 AM4/19/18
to Vivek, 3D Genomics

Hello Viv,

It is a little tricky to get correct counts for ligations in this case and depends on the grep installed on your system. The easiest thing to do is run Juicer as usual but calculate the ligation junctions separately. You would calculate the ligation junctions in the contacts using a command like this:

grep -cE '(GATCGATC|GAATGATC|GATTGATC|GACTGATC|GAGTGATC)' merged_nodups.txt

NB: I only listed the ligations junctions generated from the first two; you would need to generate all the possibilities (there are many).

For the fastqs, you would do something similar, but

paste <(zcat fname_R1.fastq.gz) <(zcat fname_R2.fastq.gz) | grep -cE '(GATCGATC|GAATGATC|GATTGATC|GACTGATC|GAGTGATC)'

Finally, you will need to generate a restriction site file the looks for the union of all these cut sites. You should modify this script to do so:
https://github.com/theaidenlab/juicer/blob/master/misc/generate_site_positions.py

In particular you need to test against multiple test strings.

If you aren’t worried about reads mapping to the same fragment, you could run Juicer with the flags “-x -s none”.

Best
Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/2e6859b9-68a9-43f0-bc22-cc38cf15f5bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Neva Cherniavsky Durand, Ph.D.
Staff Scientist, Aiden Lab

Vivek

unread,
Apr 19, 2018, 8:49:07 PM4/19/18
to 3D Genomics
Hi Neva,

Thank you for the quick reply! I was able to process my samples after creating the restriction site file containing all the restriction sites. I calculated the ligation junctions for only one of the combinations. Overall it looks alright but I am worried about the high Below MAPQ threshold percentage. I don't know if it is normal. These are sequenced with Nextseq 75bp- paired end runs. Any comments are welcome!

Sequenced Read Pairs:  335,962,570

 Normal Paired: 294,375,197 (87.62%)

 Chimeric Paired: 20,978,603 (6.24%)

 Chimeric Ambiguous: 15,480,378 (4.61%)

 Unmapped: 5,128,392 (1.53%)

 Ligation Motif Present: 52,126,754 (15.52%)

Alignable (Normal+Chimeric Paired): 315,353,800 (93.87%)

Intra-fragment Reads: 10,708,012 (3.19% / 3.54%)

Below MAPQ Threshold: 105,092,480 (31.28% / 34.77%)

Hi-C Contacts: 186,467,439 (55.50% / 61.69%)

 Ligation Motif Present: 22,382,290  (6.66% / 7.40%)

 3' Bias (Long Range): 55% - 45%

 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 15,892,647  (4.73% / 5.26%)

Intra-chromosomal: 170,574,792  (50.77% / 56.43%)

Short Range (<20Kb): 55,468,879  (16.51% / 18.35%)

Long Range (>20Kb): 115,104,816  (34.26% / 38.08%)



Thanks
Viv
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

Neva Durand

unread,
Apr 24, 2018, 3:14:03 PM4/24/18
to Vivek, 3D Genomics
That is high. This is for MAPQ > 0?  It’s very easy to double check if the statistic itself is correct. If so, it might indicate that somehow this cocktail is causing alignability problems. 

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/cd9814b8-53b0-4266-bb8f-8d4030f67e34%40googlegroups.com.

Vivek

unread,
Apr 25, 2018, 4:58:22 PM4/25/18
to 3D Genomics
Hi Neva,

Thanks for the reply! This what I got with a simple awk command:

cat merged_nodups.txt | awk 'BEGIN{N0=0;N30=0;total=0};$9<30 {N30=N30+1}; $9<=0 {N0=N0+1}; {total = total+1}; END{percent30=N30/total;percent0 = N0/total; print "no of below 30 MAPQ "percent30; print "no of zero or below MAPQ "percent0}'

no of below 30 MAPQ 0.249307

no of zero or below MAPQ 0.225292


These are my statistics for inter.txt

Experiment description: Juicer version 1.5.6; BWA 0.7.12-r1039; 32 threads; splitsize 90000000; java version "1.7.0_79"; Juicer Tools Version 1.6.2; /opt/juicer/scripts/juicer.sh -y /hi_c/references/arima_sites.txt -z /hi_c/references/dmel-all-chromosome-r5.57.fasta -p /hi_c/references/flybase_dm3.txt -g dm3 -b GATCGATC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
Sequenced Read Pairs:  335,962,570
 Normal Paired: 294,375,197 (87.62%)
 Chimeric Paired: 20,978,603 (6.24%)
 Chimeric Ambiguous: 15,480,378 (4.61%)
 Unmapped: 5,128,392 (1.53%)
 Ligation Motif Present: 52,126,754 (15.52%)
Alignable (Normal+Chimeric Paired): 315,353,800 (93.87%)
Intra-fragment Reads: 10,708,012 (3.19% / 3.54%)
Below MAPQ Threshold: 105,092,480 (31.28% / 34.77%)
Hi-C Contacts: 186,467,439 (55.50% / 61.69%)
 Ligation Motif Present: 22,382,290  (6.66% / 7.40%)
 3' Bias (Long Range): 55% - 45%
 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 15,892,647  (4.73% / 5.26%)
Intra-chromosomal: 170,574,792  (50.77% / 56.43%)
Short Range (<20Kb): 55,468,879  (16.51% / 18.35%)
Long Range (>20Kb): 115,104,816  (34.26% / 38.08%)

And for inter_30.txt: 

Experiment description: Juicer version 1.5.6; BWA 0.7.12-r1039; 32 threads; splitsize 90000000; java version "1.7.0_79"; Juicer Tools Version 1.6.2; /opt/juicer/scripts/juicer.sh -y /hi_c/references/arima_sites.txt -z /hi_c/references/dmel-all-chromosome-r5.57.fasta -p /hi_c/references/flybase_dm3.txt -g dm3 -b GATCGATC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
Sequenced Read Pairs:  335,962,570
 Normal Paired: 294,375,197 (87.62%)
 Chimeric Paired: 20,978,603 (6.24%)
 Chimeric Ambiguous: 15,480,378 (4.61%)
 Unmapped: 5,128,392 (1.53%)
 Ligation Motif Present: 52,126,754 (15.52%)
Alignable (Normal+Chimeric Paired): 315,353,800 (93.87%)
Intra-fragment Reads: 10,708,012 (3.19% / 3.54%)
Below MAPQ Threshold: 110,975,711 (33.03% / 36.71%)
Hi-C Contacts: 180,584,208 (53.75% / 59.74%)
 Ligation Motif Present: 21,695,573  (6.46% / 7.18%)
 3' Bias (Long Range): 55% - 45%
 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 13,697,579  (4.08% / 4.53%)
Intra-chromosomal: 166,886,629  (49.67% / 55.21%)
Short Range (<20Kb): 54,451,978  (16.21% / 18.01%)
Long Range (>20Kb): 112,433,576  (33.47% / 37.20%)

I don't know why there are some minor differences in the MAPQ values given by awk and the pipeline.

Thanks
Viv

Neva Durand

unread,
Apr 25, 2018, 5:14:33 PM4/25/18
to Vivek, 3D Genomics
The stats calculated in Juicer look at both reads, so fields 9 and 12.  If either is below the MAPQ threshold, they are counted as below.  Also, intra fragment reads are considered first - so for all the unique reads, first we look to see if it's an intrafragment read, and if not, we look to see if the minimum of fields 9 and 12 is below the threshold.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/a09ee3f6-c56e-4b54-a0ea-1bfdf53812d8%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Virginia M.

unread,
Oct 10, 2018, 12:41:51 AM10/10/18
to 3D Genomics
Hello,

I know this is an old thread but I am in the same situation--my data were generated with multiple restriction enzymes. Forgive me for being a little slow, but I'm confused about Neva's comment from April 19. It looks like the restriction site file generated by generate_site_positions.py has the chromosome name in the first column and the locations of all restriction sites in subsequent columns. To me "the union of all these cut sites" means that sites that are present in either the genome or the fastqs or merged_nodups.txt should all be included in the restriction site file. So what should the row look like for sites that are present only in the fastq or in merged_nodups.txt? Or did you mean the intersection of all these cut sites, i.e. sites that are present in all three files? I tried the grep commands Neva suggested, but the output was just a single long number. Maybe this is related to my version of grep. I am using: 
GNU bash, version 4.3.42(1)-release (x86_64-suse-linux-gnu)
OS is openSUSE Leap 42.2

I am happy to write my own version of generate_site_positions.py, I just want to make sure I am correctly understanding the desired output. Thanks in advance.

-Virginia

Neva Durand

unread,
Oct 10, 2018, 10:04:24 AM10/10/18
to vstar...@lbl.gov, 3D Genomics
Hello Virginia,

The grep command is for counting the ligation junctions.

To produce a restriction site file for multiple enzymes, you would indeed produce the union of all the cut sites. You would need to modify the Python script to look for your multiple sites in the reference genome. For each chromosome, you would produce a sorted list of cut sites; each cut site would correspond to one of the enzymes.

Alternatively, you could produce files for single enzymes and then combine them - this is the easiest thing to do and probably what I'd do (especially given the likelihood that you already have the single enzyme files created). I.e. for MboI and HindIII on hg19, you would take files hg19_MboI.txt and hg19_HindIII.txt, and combine their list for that chromosome. So if the first one is 

1 16007 24571 ...

and the second one is

1 11160 12411 12461 12686 12829 13315 13420 13566 13698 13915 14377 15000 15172 15750 17954 18292 18376 18393 18512 18727 20029 20393 20477 20719 23312 23615 24082 25327 ...

You would combine so the resulting line starts with

1 11160 12411 12461 12686 12829 13315 13420 13566 13698 13915 14377 15000 15172 15750 16007 17954 18292 18376 18393 18512 18727 20029 20393 20477 20719 23312 23615 24082 24571 25327 ...

Generating the restriction site file does NOT rely on raw data or on processed data - it is solely based on the reference genome and the cut site motif. The restriction site file is then used in the pipeline to produce a fragment number for the aligned data. So you would produce the restriction site file once, and then be able to use it for any experiment utilizing the same organism and enzyme cocktail.

Does that make sense?
Neva



For more options, visit https://groups.google.com/d/optout.

Virginia M.

unread,
Oct 10, 2018, 1:31:48 PM10/10/18
to 3D Genomics
Ohhh okay I was completely misunderstanding the purpose of the grep command. I appreciate the tip about merging the individual restriction site files. This all makes much more sense now. Thank you!!

Zhibo M.

unread,
Oct 19, 2018, 12:48:51 AM10/19/18
to 3D Genomics
Hi Neva,
Thanks for the explanation
I recently also encountered some data sets that used restriction enzymes recognizing multiple sites (such as G^ANTC) or using multiple enzymes. As some commercial kits are using this multiple sites strategy, I guess more and more people might face similar issue sooner or later. HiC-Pro supports multiple ligation sites in the config file but it normalization method is limited to ICE. I still prefer to use juicer. Would you consider adding a new feature to the juicer to take multiple ligation sites either withe the -b flag or edit the $ligation variable in the juicer.sh?  I think this would be a popular feature that the community will like.
As I understood, the ligation site(s) is only used in the statistics. The actual mapping and filtering uses the restriction sites file. Am I correct?
Thank you!
Zhibo

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

Neva Durand

unread,
Oct 23, 2018, 9:35:12 AM10/23/18
to Zhibo M., 3D Genomics
Yes, we can do that - could you put it in as a feature request on the Juicer GitHub?

Indeed the ligation junction is only used for statistics.


For more options, visit https://groups.google.com/d/optout.

Robert King

unread,
Feb 5, 2020, 7:14:13 AM2/5/20
to 3D Genomics
Are arima genomics enzymes an option in the restriction file creation, I don't see it. I can create a bed file using HiC-Pro script if there is a way to convert this to a format that juicer can use? I have used HiC-pro to generate maps and a hic file to use in juicerbox but I would like to go the juicer and 3d-dna route with my data to automate the scaffolding.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-ge...@googlegroups.com.

Robert King

unread,
Feb 5, 2020, 7:20:23 AM2/5/20
to 3D Genomics
silly I do see a Arima option in the latest script. 

Haiyan Lei

unread,
Feb 14, 2020, 5:59:34 PM2/14/20
to 3D Genomics
Hi, all:

I just tried to analyze the Arima-HiC data, the following is the advice from Arima technical support. I downloaded the hg19_GATC_GANTC.txt restriction file from their website. I used juicer.sh -y hg19_GATC_GANTC.txt for the analysis. I am a little confused about the multiple ligation junction. How to specify the multiple ligation during the run? use -b parameter? Can anyone please give me some suggestions? Thanks very much.

best

HY

Arima technical support advice: 

When you begin analyzing your Arima-HiC data, it is likely that you will need to tailor your analysis pipeline to account for our unique restriction enzyme cocktail. This will allow for the generation of genome cut-site files as well as the identification of ligation junctions in your dataset. The following information should allow you to input the restriction sites used with Arima HiC into Juicer.

The restriction enzymes in the Arima-HiC cocktail cut at the following motifs, where ‘^’ is the cut site on the + strand (‘N’ can be either of the 4 genomic bases):

^GATC

G^ANTC

You will need to provide a custom Arima-HiC-specific Juicer.sh command line script. Arima can provide you this file for mm9, mm10, hg19, and hg38. Please copy and paste this Ftp link into your browser to access the files (ftp://ftp-arimagenomics.sdsc.edu/pub/JUICER_CUTSITE_FILES).

The default behavior of Juicer will be to calculate the number of chimeric reads assuming the only possible ligation junction (GATCGATC), which is one of the expected ligation junctions produced by our kit. However, our Arima-HiC chemistry uses multiple restriction enzymes and can produce 25 possible ligation junction motifs:

LIGATION_SITE = GAATAATC,GAATACTC,GAATAGTC,GAATATTC,GAATGATC,GACTAATC,GACTACTC,GACTAGTC,GACTATTC,GACTGATC,GAGTAATC,GAGTACTC,GAGTAGTC,GAGTATTC,GAGTGATC,GATCAATC,GATCACTC,GATCAGTC,GATCATTC,GATCGATC,GATTAATC,GATTACTC,GATTAGTC,GATTATTC,GATTGATC

Unfortunately Juicer cannot accommodate more than one possible ligation junction motif so if you want the true number (which will be higher than what will be calculated), you would need to obtain that manually using grep. We can provide instruction for how to do this if need be. 

In addition to -y, the other argument variables we define are:

-d 

-p

-z

-D

-t

Neva Durand

unread,
Feb 28, 2020, 2:40:07 PM2/28/20
to Haiyan Lei, 3D Genomics
You can use the -b flag and call it like this:

ligation="'(GAATAATC|GAATACTC)'"

-b $ligation

putting the or symbol "|" instead of the comma everywhere.

Note that to properly see if you're getting higher ligations than expected, you should try and characterize the average occurrence of those motifs in the genome.



--
Neva Cherniavsky Durand, Ph.D.
Pronouns: she, her, hers
Assistant Professor, Aiden Lab

张彦春

unread,
Apr 8, 2020, 7:51:05 PM4/8/20
to 3D Genomics
Hello Neva, I have a similar question when processing the multiple enzyme data. The two cut site motifs are: ^GATC and G^ANTC, how can I set the appropriate site parameter for juicer? The usage webpage said the default is MboI, is it ok that I use this default value and use a restriction site file consistent to the multiple enzyme? I am not clear about the relationship between the site parameter and restriction site file and their impact on the analysis results.

Thank you.

Neva Durand

unread,
Apr 13, 2020, 2:43:29 PM4/13/20
to 张彦春, 3D Genomics
Hi,

The site is used for three things:

1 - Determining the fragment number, in order to eliminate reads that align to the same fragment
2 - Determining the ligation frequency
3 - Calculating some QC statistics and graphs

These are all most relevant when there is a single cut site. Ligation count is still meaningful with multiple cut sites, but you need to have a background expected model (e.g. how often the sequences appear in the genome by chance). Eliminating intra-fragment reads is something we do along with other groups in this space (some groups eliminate even more reads, i.e. both intra fragment reads and reads aligning to the their immediate neighbor) but more research needs to be done as to whether or not this is absolutely necessary.  The QC graphs are certainly useful as a sanity check, but not strictly necessary to determine if you have a good library.

If you want to count ligation junctions, you can use the "-b" flag as above and send in the site as "none". You won't remove intra fragment reads or calculate the QC graphs and the "distance from RE site" statistics will be meaningless. Or you could send in the site map linked in this thread with the "-f" flag and the ligation with the "-b" flag.

Alternatively you could just run with "-s none". 

If you run with default (i.e. MboI), you risk losing some reads in your hic file that are incorrectly marked as intrafragment.

Hope that helps.
Neva

On Wed, Apr 8, 2020 at 7:51 PM 张彦春 <yczh...@gmail.com> wrote:
Hello Neva, I have a similar question when processing the multiple enzyme data. The two cut site motifs are: ^GATC and G^ANTC, how can I set the appropriate site parameter for juicer? The usage webpage said the default is MboI, is it ok that I use this default value and use a restriction site file consistent to the multiple enzyme? I am not clear about the relationship between the site parameter and restriction site file and their impact on the analysis results.

Thank you.

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages