input needed to create .hic file using pre command

rdali

unread,

Jul 28, 2016, 11:57:19 AM7/28/16

to 3D Genomics

Hello,

I have just started to look at the juicebox command line tools. I am trying to create the 11 column paired read input file that is used by the pre command to output the .hic file needed by juicebox.

I am wondering whether there is a script or a program that outputs that 11 column input file from a HiC bam file or whether I need to write my own script to extract that data?

In case I need to do it myself, where can I extract fragment information from?

My data was aligned and filtered by hicup and further HiC processing was run by Homer.

Thank you!

Yunjiang Qiu

unread,

Jul 28, 2016, 4:02:59 PM7/28/16

to 3D Genomics

I think you may use juicer pipeline directly, otherwise you may need to write your script to transfer from bam file.

Neva Durand

unread,

Jul 28, 2016, 4:22:43 PM7/28/16

to rdali, 3D Genomics

Hello,

Could you describe the “HiC BAM file”? Are these just HiC contacts stored in a BAM? The “pre” command can’t take in a raw bam because the data needs to be processed first, and in particular chimeras needs to be handled appropriately.

You will have to assign the fragment numbers. The script fragment.pl in Juicer does this:
https://github.com/theaidenlab/juicer/blob/master/UGER/scripts/fragment.pl

You will need a restriction site file. You can create one by using or modifying generate_site_positions.py:
https://github.com/theaidenlab/juicer/blob/master/misc/generate_site_positions.py

If you don’t need the fragment information, you can just use the strand, chromosome, and position information and assign a dummy fragment number. Be sure to assign a different one to each read end so that “pre” doesn’t assume they are on the same fragment. (E.g. assign 0 to the first read end and 1 to the second.)

Here is a code fragment from Juicer that does this:

awk '{printf("%s %s %s %d %s %s %s %d", \$1, \$2, \$3, 0, \$4, \$5, \$6, 1); for (i=7; i<=NF; i++) {printf(" %s",\$i);}printf("\n");}' $name${ext}_norm.txt > $name${ext}.frag.txt

Also note that Juicer now includes support for DNAse mode; just set your restriction enzyme to “none”.

Best
Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/4ff57497-3fde-4de6-9573-c2b2d619ba7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Neva Cherniavsky Durand, Ph.D.

Staff Scientist, Aiden Lab

www.aidenlab.org

Message has been deleted

rdali

unread,

Jul 29, 2016, 11:52:01 AM7/29/16

to 3D Genomics, rola...@gmail.com

Hello Neva,

Thank you for the quick reply.

Yes, I have a regular bam file with aligned HiC reads that has already been filtered for read quality and HiC library artifacts, among other things.

I simply want to use that data to call domains using arrowhead algorithm.

I will extract the information needed from the bam using an awk command similar to the one you posted. For the strand information, how important is it? I assume I would need to decompose the flag field in the bam file to fill that column in. If it is not critical to the Arrowhead algorithm, I can save time and fill it with dummy numbers 1 and 0 for read1 and read2 respectively? My bam is 280Gb so unnecessary parsing is always a plus.

Thank you!

Rola

Neva Durand

unread,

Jul 29, 2016, 1:23:36 PM7/29/16

to rdali, 3D Genomics

Hi Rola,

Re: strand information: on the one hand, “pre” does not currently use this information, but on the other hand, it’s important for duplicate filtering. Anyway it’s pretty easy to extract from the SAM, just put something like this in your awk code:

# strand; Bit 16 set means reverse strand
strand = and($2,16);

The more salient issue is whether or not the BAM is just paired reads from HiC. Basically the task is easy if the BAM already has only paired contacts (i.e. no chimeric reads/they’ve already been handled), but the reads from alignment do have chimeras and those need to be handled properly. We’ve explored using BAM in the past and might consider it again; the source code of Juicebox does know how to read BAM files already so it’s just a question of tying things together.

From briefly looking at hicup, I believe that chimeric reads are thrown away - just be aware that you should expect chimeric reads in a Hi-C experiment and indeed would want to keep those reads if they can be properly paired.

Best

Neva

--

You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/3234a0e9-2c86-4811-8ba5-dfc4a4010cfd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rdali

unread,

Jul 29, 2016, 3:00:25 PM7/29/16

to 3D Genomics, rola...@gmail.com

Hello Neva,

I believe Hicup throws away the chimeric reads. I assume the reads in the bam are fine to move forward with.

I used awk to extract the required fields from the bam file:

samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=$3; pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=$3; pos2=$4; mapq2=$5; print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}' > GM12878.Arrowhead.input

The bam is sorted by read name. I have checked that the 2 reads follow each other.

This results in the 11 column file required:

SRR1658598G.122 16 chr1 152123221 0 0 chr1 165457762 1 42 42

SRR1658598G.260 0 chr1 147380159 0 16 chr1 147274676 1 42 42

SRR1658598G.265 0 chr1 186432636 0 16 chr1 192496596 1 42 42

SRR1658598G.339 16 chr1 165278667 0 16 chr1 170676175 1 42 42

SRR1658575G.271 16 chr2 108019611 0 16 chr2 107565187 1 42 42

SRR1658575G.273 0 chr2 100647591 0 16 chr2 100613456 1 42 42

SRR1658575G.332 16 chr2 201220553 0 16 chr2 146459489 1 42 42

SRR1658575G.342 0 chr2 120500274 0 16 chr2 120501689 1 42 42

SRR1658575G.373 0 chr2 190732478 0 16 chr2 191695113 1 42 42

SRR1658575G.374 0 chr2 111900804 0 16 chr2 112277747 1 42 42

When I use the pre command:

java -jar juicebox_tools_7.5.jar pre -q 10 GM12878.Arrowhead.input GM12878.Arrowhead.hic hg19

I get the following error:

java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.

at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1387)

at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1158)

at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:572)

at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:303)

at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213)

at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94)

at juicebox.tools.HiCTools.main(HiCTools.java:77)

The input file contains 460268 lines only just for testing. Is that the problem?

Neva Durand

unread,

Jul 29, 2016, 3:47:53 PM7/29/16

to rdali, 3D Genomics

No, the problem is that the hg19 we use has the chromosome names defined as "1" "2" etc not "chr1". You can either remove the "chr" (via sed for example) or provide your own chrom.sizes file with the -p flag. The chrom.sizes just lists, for every chromosome you want to see, the chromosome name and size of chromosome in bp (tab delimited).

Best

Neva

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/b57895ec-c7f4-4d4d-b569-9fea5d815c7d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rdali

unread,

Jul 29, 2016, 4:05:56 PM7/29/16

to 3D Genomics, rola...@gmail.com

Thank! that's easy to fix.

samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}' > GM12878.Arrowhead.input

I ran the pre command again and got:

......Error: the chromosome combination 2_2 appears in multiple blocks

Neva Durand

unread,

Jul 29, 2016, 4:42:27 PM7/29/16

to rdali, 3D Genomics

Ah, yes - the file must be sorted by chromosome (at least, the chromosome blocks need to be together).

samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1}else{ print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}'  | sort -k3,3d -k7,7d > GM12878.Arrowhead.input

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/78b2a278-da04-4e5f-bc87-4b4c77932416%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rdali

unread,

Jul 29, 2016, 5:41:05 PM7/29/16

to 3D Genomics, rola...@gmail.com

Thank you Neva! That works well.

You are missing a bracket in case someone else needs the code.

samtools view input.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1} else {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}}'  | sort -k3,3d -k7,7d > Arrowhead.input

The bam is a bam of paired reads sorted by Read name. The chromosome scaffolds are in the form of "chrA" (which is why chr1=substr($3, 4) removes the first 3 characters.

example:

SRR1658598G.122 83 chr1 152123221 42 101M = 165457762 0 TGCAATTTCTGACTTCTGGGACCTTACTGCTTGGTGGGAAACACATGGGTTTAAAATTTACTTCAATGCAATCTGATAAGCTGTATGCTGATGATAGGTGC >;DDDDDDEEECECDEFDDFHGHHHEGIHDJIGHGHGBGHFFDJIHGJJJJIJJJJJIIHJIJJJJJIJJJIIGIIJJJGJJJJJJIJHHHFHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR

SRR1658598G.122 163 chr1 165457762 42 72M = 152123221 0 GTCTAATGTGGTCAGAGTTGGGGACCATTAGTCTGGATAAGGCAATTTTCTAAATATCTTGAAATTTGGATC C@CFFFFFBFHFHGHFHEIIHGIIIJFHIGHFHIJJ>HCHCHGIJJIJJJGDHIIJIJJJIEGHIIIGGGIG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU CT:Z:FAR

SRR1658598G.260 99 chr1 147380159 42 26M = 147274676 0 TCACCGTGCTTTTCATCTTCCGGATC BB@DFFFFGGHHHJJJGJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:26 YT:Z:UU CT:Z:FAR

SRR1658598G.260 147 chr1 147274676 42 101M = 147380159 0 CTTCTGCCCACGTGTTCTGCGTCATCCTACCTGTAATGTGGCAGTGGGGACAAGGATTGTAGGGTTGTTCCTTTTGGCAAAAAGAGTAACCGGTAGGTTTA CDDBC@DBB<>DDDC88@BBCCDCCCDEECB??FDBHAA=?HGBIHF;IGHIGGJIIEJHFD;BGJJGIGIIJIIGIJJIIGHGGGHHFHGHHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR

Neva Durand

unread,

Jul 29, 2016, 5:42:23 PM7/29/16

to rdali, 3D Genomics

Great! Thanks for the correction and for posting the code.

Best

Neva

--

You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/5c6901dc-c1cd-4aef-8c50-e45a83f4ee24%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Harish Kothandaraman

unread,

Jul 24, 2019, 1:12:21 PM7/24/19

to 3D Genomics

Hi!

Apologies that I'm reviving such an old thread, but I'm a bit confused if not curious and I was hoping that I might get it clear.

I'm using an Arima based kit, that has two restriction enzymes. I have been able to generate the map file which is required for Juicer. Meanwhile, I'm planning to rectify the assembly as well. Scaffolds generated using SALSA, but lots of post-processing on it has been done that it's not computationally cheap at this point to re-run the processing.

I have thus mapped the Hi-C reads to the genome and have a filtered BAM as per Arima's recommendations.

The question that I have is this:

Since the above awk command does generate a 11 column text file and you did suggest earlier to use fragment.pl to input the positions identified for the kit, should I drop the field that adds the fragment site positions in the awk? Because I don't the sorted file gets 13 columns which doesn't get parsed by Juicer.

However, at this point, I do have another option available. Juicer pre does take the fragment site map using -f option as well.

Should I just give the sorted file generated using the BAM file and pass it alongwith the map in Juicer? Does it make sense?

###########################

What I'm currently doing looks like this:

samtools view Filtered.bam | awk command | sort command > hic_input.11 (11 being the column counts)

java -jar juicer.jar pre -f site_map.txt hic_input.11 scaffolds.hic scaffolds_size.tsv

###########################

Where as what has been suggested is this:

samtools view Filtered.bam | awk command | sort command > hic_input.11 ((11 being the column counts)

perl fragment.pl hic_input.11 juicer.input site_map.txt (however juicer.input now has 13 columns, which makes no sense!)

##############################

Apologies for the ridiculous question and examples, but I'm a bit confused right now!

Thank you!

Regards,

Harish

On Saturday, July 30, 2016 at 3:12:23 AM UTC+5:30, Neva Durand wrote:

Great! Thanks for the correction and for posting the code.

Best
Neva

On Fri, Jul 29, 2016 at 2:41 PM, rdali <rola...@gmail.com> wrote:

Thank you Neva! That works well.

You are missing a bracket in case someone else needs the code.

samtools view input.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1} else {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}}' | sort -k3,3d -k7,7d > Arrowhead.input

The bam is a bam of paired reads sorted by Read name. The chromosome scaffolds are in the form of "chrA" (which is why chr1=substr($3, 4) removes the first 3 characters.

example:

SRR1658598G.122 83 chr1 152123221 42 101M = 165457762 0 TGCAATTTCTGACTTCTGGGACCTTACTGCTTGGTGGGAAACACATGGGTTTAAAATTTACTTCAATGCAATCTGATAAGCTGTATGCTGATGATAGGTGC >;DDDDDDEEECECDEFDDFHGHHHEGIHDJIGHGHGBGHFFDJIHGJJJJIJJJJJIIHJIJJJJJIJJJIIGIIJJJGJJJJJJIJHHHFHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR

SRR1658598G.122 163 chr1 165457762 42 72M = 152123221 0 GTCTAATGTGGTCAGAGTTGGGGACCATTAGTCTGGATAAGGCAATTTTCTAAATATCTTGAAATTTGGATC C@CFFFFFBFHFHGHFHEIIHGIIIJFHIGHFHIJJ>HCHCHGIJJIJJJGDHIIJIJJJIEGHIIIGGGIG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU CT:Z:FAR

SRR1658598G.260 99 chr1 147380159 42 26M = 147274676 0 TCACCGTGCTTTTCATCTTCCGGATC BB@DFFFFGGHHHJJJGJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:26 YT:Z:UU CT:Z:FAR

SRR1658598G.260 147 chr1 147274676 42 101M = 147380159 0 CTTCTGCCCACGTGTTCTGCGTCATCCTACCTGTAATGTGGCAGTGGGGACAAGGATTGTAGGGTTGTTCCTTTTGGCAAAAAGAGTAACCGGTAGGTTTA CDDBC@DBB<>DDDC88@BBCCDCCCDEECB??FDBHAA=?HGBIHF;IGHIGGJIIEJHFD;BGJJGIGIIJIIGIJJIIGHGGGHHFHGHHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-ge...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/5c6901dc-c1cd-4aef-8c50-e45a83f4ee24%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Harish Kothandaraman

unread,

Jul 24, 2019, 1:15:15 PM7/24/19

to 3D Genomics

Sorry for the double post!

Do you want the BAMs to be sorted on the basis of corrdinate or on readnames?

Neva Durand

unread,

Aug 9, 2019, 10:49:55 AM8/9/19

to Harish Kothandaraman, 3D Genomics

BAM should be sorted by read name. The awk code ignores the fragment number (assigns everything to a different fragment). If instead you want to properly assign the fragment, you would do a slightly different command - let me know if this is still something you're working on and I can send.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/045c8503-dafc-42e2-907a-06e54fe762e3%40googlegroups.com.

Reply all

Reply to author

Forward