Hello,
Could you describe the “HiC BAM file”? Are these just HiC contacts stored in a BAM? The “pre” command can’t take in a raw bam because the data needs to be processed first, and in particular chimeras needs to be handled appropriately.
You will have to assign the fragment numbers. The script fragment.pl in Juicer does this:
https://github.com/theaidenlab/juicer/blob/master/UGER/scripts/fragment.pl
You will need a restriction site file. You can create one by using or modifying generate_site_positions.py:
https://github.com/theaidenlab/juicer/blob/master/misc/generate_site_positions.py
If you don’t need the fragment information, you can just use the strand, chromosome, and position information and assign a dummy fragment number. Be sure to assign a different one to each read end so that “pre” doesn’t assume they are on the same fragment. (E.g. assign 0 to the first read end and 1 to the second.)
Here is a code fragment from Juicer that does this:
awk '{printf("%s %s %s %d %s %s %s %d", \$1, \$2, \$3, 0, \$4, \$5, \$6, 1); for (i=7; i<=NF; i++) {printf(" %s",\$i);}printf("\n");}' $name${ext}_norm.txt > $name${ext}.frag.txt
Also note that Juicer now includes support for DNAse mode; just set your restriction enzyme to “none”.
Best
Neva
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/4ff57497-3fde-4de6-9573-c2b2d619ba7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Rola,
Re: strand information: on the one hand, “pre” does not currently use this information, but on the other hand, it’s important for duplicate filtering. Anyway it’s pretty easy to extract from the SAM, just put something like this in your awk code:
# strand; Bit 16 set means reverse strand
strand = and($2,16);
The more salient issue is whether or not the BAM is just paired reads from HiC. Basically the task is easy if the BAM already has only paired contacts (i.e. no chimeric reads/they’ve already been handled), but the reads from alignment do have chimeras and those need to be handled properly. We’ve explored using BAM in the past and might consider it again; the source code of Juicebox does know how to read BAM files already so it’s just a question of tying things together.
From briefly looking at hicup, I believe that chimeric reads are thrown away - just be aware that you should expect chimeric reads in a Hi-C experiment and indeed would want to keep those reads if they can be properly paired.
Best
Neva
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/3234a0e9-2c86-4811-8ba5-dfc4a4010cfd%40googlegroups.com.
samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=$3; pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=$3; pos2=$4; mapq2=$5; print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}' > GM12878.Arrowhead.input
SRR1658598G.122 16 chr1 152123221 0 0 chr1 165457762 1 42 42
SRR1658598G.260 0 chr1 147380159 0 16 chr1 147274676 1 42 42
SRR1658598G.265 0 chr1 186432636 0 16 chr1 192496596 1 42 42
SRR1658598G.339 16 chr1 165278667 0 16 chr1 170676175 1 42 42
SRR1658575G.271 16 chr2 108019611 0 16 chr2 107565187 1 42 42
SRR1658575G.273 0 chr2 100647591 0 16 chr2 100613456 1 42 42
SRR1658575G.332 16 chr2 201220553 0 16 chr2 146459489 1 42 42
SRR1658575G.342 0 chr2 120500274 0 16 chr2 120501689 1 42 42
SRR1658575G.373 0 chr2 190732478 0 16 chr2 191695113 1 42 42
SRR1658575G.374 0 chr2 111900804 0 16 chr2 112277747 1 42 42
java -jar juicebox_tools_7.5.jar pre -q 10 GM12878.Arrowhead.input GM12878.Arrowhead.hic hg19
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1387)
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1158)
at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:572)
at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:303)
at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213)
at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94)
at juicebox.tools.HiCTools.main(HiCTools.java:77)
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/b57895ec-c7f4-4d4d-b569-9fea5d815c7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}' > GM12878.Arrowhead.input
......Error: the chromosome combination 2_2 appears in multiple blocks
Ah, yes - the file must be sorted by chromosome (at least, the chromosome blocks need to be together).
samtools view GM12878.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1}else{ print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}' | sort -k3,3d -k7,7d > GM12878.Arrowhead.input
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/78b2a278-da04-4e5f-bc87-4b4c77932416%40googlegroups.com.
samtools view input.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1} else {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}}' | sort -k3,3d -k7,7d > Arrowhead.input
SRR1658598G.122 83 chr1 152123221 42 101M = 165457762 0 TGCAATTTCTGACTTCTGGGACCTTACTGCTTGGTGGGAAACACATGGGTTTAAAATTTACTTCAATGCAATCTGATAAGCTGTATGCTGATGATAGGTGC >;DDDDDDEEECECDEFDDFHGHHHEGIHDJIGHGHGBGHFFDJIHGJJJJIJJJJJIIHJIJJJJJIJJJIIGIIJJJGJJJJJJIJHHHFHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR
SRR1658598G.122 163 chr1 165457762 42 72M = 152123221 0 GTCTAATGTGGTCAGAGTTGGGGACCATTAGTCTGGATAAGGCAATTTTCTAAATATCTTGAAATTTGGATC C@CFFFFFBFHFHGHFHEIIHGIIIJFHIGHFHIJJ>HCHCHGIJJIJJJGDHIIJIJJJIEGHIIIGGGIG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU CT:Z:FAR
SRR1658598G.260 99 chr1 147380159 42 26M = 147274676 0 TCACCGTGCTTTTCATCTTCCGGATC BB@DFFFFGGHHHJJJGJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:26 YT:Z:UU CT:Z:FAR
SRR1658598G.260 147 chr1 147274676 42 101M = 147380159 0 CTTCTGCCCACGTGTTCTGCGTCATCCTACCTGTAATGTGGCAGTGGGGACAAGGATTGTAGGGTTGTTCCTTTTGGCAAAAAGAGTAACCGGTAGGTTTA CDDBC@DBB<>DDDC88@BBCCDCCCDEECB??FDBHAA=?HGBIHF;IGHIGGJIIEJHFD;BGJJGIGIIJIIGIJJIIGHGGGHHFHGHHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/5c6901dc-c1cd-4aef-8c50-e45a83f4ee24%40googlegroups.com.
Great! Thanks for the correction and for posting the code.BestNeva
On Fri, Jul 29, 2016 at 2:41 PM, rdali <rola...@gmail.com> wrote:
Thank you Neva! That works well.You are missing a bracket in case someone else needs the code.
samtools view input.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {name1=$1; str1=and($2,16); chr1=substr($3, 4); pos1=$4; mapq1=$5; getline; name2=$1; str2=and($2,16); chr2=substr($3, 4); pos2=$4; mapq2=$5; if(name1==name2) { if (chr1>chr2){print name1, str2, chr2, pos2,1, str1, chr1, pos1, 0, mapq2, mapq1} else {print name1, str1, chr1, pos1, 0, str2, chr2, pos2 ,1, mapq1, mapq2}}}' | sort -k3,3d -k7,7d > Arrowhead.inputThe bam is a bam of paired reads sorted by Read name. The chromosome scaffolds are in the form of "chrA" (which is why chr1=substr($3, 4) removes the first 3 characters.example:SRR1658598G.122 83 chr1 152123221 42 101M = 165457762 0 TGCAATTTCTGACTTCTGGGACCTTACTGCTTGGTGGGAAACACATGGGTTTAAAATTTACTTCAATGCAATCTGATAAGCTGTATGCTGATGATAGGTGC >;DDDDDDEEECECDEFDDFHGHHHEGIHDJIGHGHGBGHFFDJIHGJJJJIJJJJJIIHJIJJJJJIJJJIIGIIJJJGJJJJJJIJHHHFHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR
SRR1658598G.122 163 chr1 165457762 42 72M = 152123221 0 GTCTAATGTGGTCAGAGTTGGGGACCATTAGTCTGGATAAGGCAATTTTCTAAATATCTTGAAATTTGGATC C@CFFFFFBFHFHGHFHEIIHGIIIJFHIGHFHIJJ>HCHCHGIJJIJJJGDHIIJIJJJIEGHIIIGGGIG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU CT:Z:FAR
SRR1658598G.260 99 chr1 147380159 42 26M = 147274676 0 TCACCGTGCTTTTCATCTTCCGGATC BB@DFFFFGGHHHJJJGJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:26 YT:Z:UU CT:Z:FAR
SRR1658598G.260 147 chr1 147274676 42 101M = 147380159 0 CTTCTGCCCACGTGTTCTGCGTCATCCTACCTGTAATGTGGCAGTGGGGACAAGGATTGTAGGGTTGTTCCTTTTGGCAAAAAGAGTAACCGGTAGGTTTA CDDBC@DBB<>DDDC88@BBCCDCCCDEECB??FDBHAA=?HGBIHF;IGHIGGJIIEJHFD;BGJJGIGIIJIIGIJJIIGHGGGHHFHGHHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU CT:Z:FAR
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-ge...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/5c6901dc-c1cd-4aef-8c50-e45a83f4ee24%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/045c8503-dafc-42e2-907a-06e54fe762e3%40googlegroups.com.