Hello there,
I am working with micro-c data. I was able to create hic matrix with juicer pre command. However, I am running into issues arising from the different number of dimensions for same resolution across my samples. My understanding is that since I am using hg19 chrom size files to partition to get the resolution, all the dimensions of my samples should be same as long as the resolution is the same. This is not true in my data. For example, at 50kb, chr3, chr5, and chr16 of my 2 biological replicates have differing dimensions. In chr3, there is 1 extra dimension in my rep 1. In chr 5, there is 4 extra dimensions in my rep 1. Chr 16 has 1 extra dimension in rep 1. Some extra dimensions/slots have 0s, some have 0s and non-zero values.
Then I checked for the genomic coordinates of the bins in txt file extracted with straw. I am finding that number of observations for each of my samples differ widely (looks like 0 bins were excluded with straw) and their coordinates don’t match.
Could someone please take a look and advice on this matter? Any help you can provide is greatly appreciated.
The following is the exact codes were used and some helpful R output to demonstrate this issue.
1. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)
a. Exact script used to make .pairs file
#align to final valid pairs
bwa mem -5SP -T0 -t16 A673_Bio1_DBD_R1_comb.fastq.gz A673_Bio1_DBD_R2_comb.fastq.gz | \
pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in 16 --nproc-out 16 --chroms-path /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/WholeGenomeFasta/genome.fa | \
pairtools sort --tmpdir=/gpfs0/home1/temp/ --nproc 16 |
pairtools dedup --nproc-in 16 --nproc-out 16 --mark-dups --output-stats A673_Bio1_DBD_microc_stat.txt |
pairtools split --nproc-in 16 --nproc-out 16 --output-pairs A673_Bio1_DBD_mapped.pairs --output-sam -|samtools view -bS -@16 | \
samtools sort -@16 -o A673_Bio1_DBD_mapped.PT.bam;samtools index A673_Bio1_DBD_mapped.PT.bam
#qc
python3 /export/apps/opt/Micro-C/get_qc.py -p A673_Bio1_DBD_microc_stat.txt
2. Ran Juicer pre to create hic matrix as suggested by (https://micro-c.readthedocs.io/en/latest/contact_map.html#generating-hic-contact-maps-using-juicer-tools)
a. Exact code used to create hic matrix
java -Xmx48000m -Djava.awt.headless=true -jar /juicer_tools_1.22.01.jar pre --threads 16 *_mapped.pairs A673_"$dir"_matrix.hic ../matrix/hg19.chrom.sizes.clean
b. When HicRep tool was used to calculate SCC (https://github.com/TaoYang-dev/hicrep), some chromosomes dimensions of the matrix don't match with the other replicate dimension.
See below:
#importing matrices from hic using hicrep hic2mat command
#Rep1 of DBD condition
DBD1_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
mat1 <-
hic2mat(
file = "/Bio1_DBD/A673_Bio1_DBD_matrix.hic",
chromosome1 = i,
chromosome2 = i,
resol = 50000,
method = "KR"
)
DBD1_matrices[[i]] <- mat1
}
#Rep 2 pf DBD condition
DBD2_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
mat2 <-
hic2mat(
file = "/Bio2_DBD/A673_Bio2_DBD_matrix.hic",
chromosome1 = i,
chromosome2 = i,
resol = 50000,
method = "KR"
)
DBD2_matrices[[i]] <- mat2
}
#dimensions of chr3 in both replicates
dim(DBD1_matrices[["chr3"]])
[1] 3960 3960
dim(DBD2_matrices[["chr3"]])
[1] 3959 3959
3. Used straw to 3 columned text file
a. straw NONE ../Bio1_DBDp/A673_Bio1_DBDp_matrix.hic "$chr" "$chr" BP 5000 > 5kb/DBDp1/A673_DBDp1_"$chr"_NONE_5kb.txt
b. After improting the txt file into R, DBD1 has 1,855,898 observartions versus DBD2 has 1,118,955
See below for the mismatching genomic coordinates of the bins.
DBD1_21$region1 (start of genomic location of region 1)
[1] 9410000 9410000 9415000 9415000 9420000 9415000 9420000 9425000 9410000 9415000 9420000 9425000 9430000 9415000
[15] 9420000 9425000 9430000 9435000 9420000 9425000 9430000 9435000 9440000 9435000 9440000 9445000 9415000 9430000
[29] 9435000 9440000 9445000 9450000 9420000 9425000 9430000 9435000 9440000 9450000 9455000 9440000 9450000 9455000
[43] 9460000 9425000 9435000 9455000 9460000 9465000 9415000 9425000 9430000 9435000 9440000 9450000 9455000 9460000
[57] 9465000 9470000 9415000 9420000 9425000 9430000 9450000 9455000 9465000 9470000 9475000 9430000 9470000 9475000
[71] 9480000 9435000 9450000 9470000 9480000 9485000 9415000 9425000 9460000 9470000 9480000 9485000 9490000 9440000
[85] 9475000 9495000 9435000 9440000 9495000 9500000 9505000 9415000 9445000 9475000 9505000 9510000 9415000 9430000
[99] 9445000 9470000 9515000 9515000 9520000 9450000 9490000 9505000 9525000 9410000 9525000 9530000 9420000 9425000
DBD2_21$region1 (start of genomic location of region 2)
[1] 9410000 9410000 9415000 9415000 9420000 9415000 9420000 9425000 9415000 9425000 9430000 9415000 9420000 9425000
[15] 9430000 9435000 9420000 9425000 9430000 9435000 9440000 9435000 9440000 9445000 9435000 9440000 9450000 9430000
[29] 9450000 9455000 9435000 9455000 9460000 9430000 9455000 9465000 9425000 9435000 9450000 9470000 9435000 9465000
[43] 9470000 9475000 9435000 9450000 9470000 9475000 9480000 9480000 9485000 9465000 9470000 9490000 9480000 9495000
[57] 9435000 9495000 9500000 9505000 9510000 9515000 9495000 9520000 9515000 9520000 9525000 9505000 9530000 9415000
[71] 9435000 9495000 9525000 9530000 9535000 9420000 9535000 9540000 9515000 9525000 9535000 9540000 9545000 9470000
[85] 9550000 9555000 9475000 9550000 9555000 9560000 9530000 9560000 9565000 9565000 9570000 9435000 9535000 9560000
[99] 9570000 9575000 9420000 9435000 9575000 9580000 9435000 9575000 9585000 9435000 9545000 9570000 9590000 9435000
((DBD1_21$region1[1:1118955] - DBD2_21$region1[1:1118955])) #0 means the genomic coordinates matched, any other number means the coordinates shifted by that much
[1] 0 0 0 0 0 0 0 0 -5000 -10000 -10000 10000 10000 -10000
[15] -10000 -10000 10000 10000 -10000 -10000 -10000 0 0 -10000 5000 5000 -35000 0
[29] -15000 -15000 10000 -5000 -40000 -5000 -25000 -30000 15000 15000 5000 -30000 15000 -10000
[43] -10000 -50000 0 5000 -10000 -10000 -65000 -55000 -55000 -30000 -30000 -40000 -25000 -35000
[57] 30000 -25000 -85000 -85000 -85000 -85000 -45000 -65000 -50000 -50000 -50000 -75000 -60000 60000
[71] 45000 -60000 -75000 -60000 -55000 65000 -120000 -115000 -55000 -55000 -55000 -55000 -55000 -30000
[85] -75000 -60000 -40000 -110000 -60000 -60000 -25000 -145000 -120000 -90000 -65000 75000 -120000 -130000
[99] -125000 -105000 95000 80000 -55000 -130000 55000 -70000 -60000 -25000 -20000 -40000 -170000 -10000
which( (DBD1_21$region1[1:1118955] - DBD2_21$region1[1:1118955])>0) #shows slots with differing genomic location
[1] 12 13 17 18 25 26 31 37 38 39 41 46 57 70 71 76 96 101 102 105 113 114 115 116 117 118
[27] 120 121 122 124 125 126 127 128 130 132 134 135 138 141 142 143 144 145 146 147 149 150 153 154 155 156d
[53] 157 158 159 160 162 163 164 166 167 168 169 173 174 177 179 180 182 185 188 192 193 195 500 510 554 581
[79] 616 642 651 682 697 700 701 702 703 704 720 721 726 727 728 729 736 737 738 739 740 741 750 751 752 754
Hello there,
I am working with micro-c data and used pre to make hic matrices. However, I am running into issues arising from the different number of dimensions for same resolution across my samples. My understanding is that since I am using hg19 chrom size files to partition to get the resolution, all the dimensions of my samples should be same as long as the resolution is the same. This is not true in my data. For example, at 50kb, chr3, chr5, and chr16 of my 2 biological replicates have differing dimensions. In chr3, there is 1 extra dimension in my rep 1. In chr 5, there is 4 extra dimensions in my rep 1. Chr 16 has 1 extra dimension in rep 1. Some extra dimensions/slots have 0s, some have 0s and non-zero values.
Then I checked for the genomic coordinates of the bins in txt file extracted with straw. I am finding that number of observations for each of my samples differ widely (looks like 0 bins were excluded with straw) and their coordinates don’t match.
Could someone please take a look and advice on this matter? Any help you can provide is greatly appreciated.
The following is detailed step-by-step to demonstrate the issue.
1. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)
a. Exact script used to make .pairs file
#align to final valid pairs
bwa mem -5SP -T0 -t16 /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/BWAIndex/genome.fa A673_Bio1_DBD_R1_comb.fastq.gz A673_Bio1_DBD_R2_comb.fastq.gz | \
pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in 16 --nproc-out 16 --chroms-path /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/WholeGenomeFasta/genome.fa | \
pairtools sort --tmpdir=/gpfs0/home1/gdlessnicklab/rsaxb003/temp/ --nproc 16 |
pairtools dedup --nproc-in 16 --nproc-out 16 --mark-dups --output-stats A673_Bio1_DBD_microc_stat.txt |
pairtools split --nproc-in 16 --nproc-out 16 --output-pairs A673_Bio1_DBD_mapped.pairs --output-sam -|samtools view -bS -@16 | \
samtools sort -@16 -o A673_Bio1_DBD_mapped.PT.bam;samtools index A673_Bio1_DBD_mapped.PT.bam
#qc
python3 /export/apps/opt/Micro-C/get_qc.py -p A673_Bio1_DBD_microc_stat.txt
2. Ran Juicer pre to create hic matrix as suggested by (https://micro-c.readthedocs.io/en/latest/contact_map.html#generating-hic-contact-maps-using-juicer-tools)
a. Exact code used to create hic matrix
java -Xmx48000m -Djava.awt.headless=true -jar /gpfs0/home1/gdlessnicklab/lab/software/Micro-C/juicer_tools_1.22.01.jar pre --threads 16 *_mapped.pairs A673_"$dir"_matrix.hic ../matrix/hg19.chrom.sizes.clean
b. When HicRep tool was used to calculate SCC (https://github.com/TaoYang-dev/hicrep), some chromosomes dimensions of the matrix don't match with the other replicate dimension.
See below:
#importing matrices from hic using hicrep hic2mat command
#Rep1 of DBD condition
DBD1_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
mat1 <-
hic2mat(
file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio1_DBD/A673_Bio1_DBD_matrix.hic",
chromosome1 = i,
chromosome2 = i,
resol = 50000,
method = "KR"
)
DBD1_matrices[[i]] <- mat1
}
#Rep 2 pf DBD condition
DBD2_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
mat2 <-
hic2mat(
file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio2_DBD/A673_Bio2_DBD_matrix.hic",
[79] 616 642 651 682 697 700 701 702 703 704 720 721 726 727 728 729 736 737 738 739 740 741 750 751 752 7541. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)