hic matrices of same resolution have differing dimensions

36 views
Skip to first unread message

Ariunaa Bayanjargal

unread,
Apr 15, 2023, 8:52:07 PM4/15/23
to 3D Genomics

Hello there,

I am working with micro-c data. I was able to create hic matrix with juicer pre command. However, I am running into issues arising from the different number of dimensions for same resolution across my samples. My understanding is that since I am using hg19 chrom size files to partition to get the resolution, all the dimensions of my samples should be same as long as the resolution is the same. This is not true in my data. For example, at 50kb, chr3, chr5, and chr16 of my 2 biological replicates have differing dimensions. In chr3, there is 1 extra dimension in my rep 1. In chr 5, there is 4 extra dimensions in my rep 1. Chr 16 has 1 extra dimension in rep 1. Some extra dimensions/slots have 0s, some have 0s and non-zero values.

Then I checked for the genomic coordinates of the bins in txt file extracted with straw. I am finding that number of observations for each of my samples differ widely (looks like 0 bins were excluded with straw) and their coordinates don’t match.

Could someone please take a look and advice on this matter? Any help you can provide is greatly appreciated. 

The following is the exact codes were used and some helpful R output to demonstrate this issue. 

1. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)

a. Exact script used to make .pairs file

#align to final valid pairs
bwa mem -5SP -T0 -t16 A673_Bio1_DBD_R1_comb.fastq.gz A673_Bio1_DBD_R2_comb.fastq.gz | \
pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in 16 --nproc-out 16 --chroms-path /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/WholeGenomeFasta/genome.fa | \
pairtools sort --tmpdir=/gpfs0/home1/temp/ --nproc 16 |
pairtools dedup --nproc-in 16 --nproc-out 16 --mark-dups --output-stats A673_Bio1_DBD_microc_stat.txt |
pairtools split --nproc-in 16 --nproc-out 16 --output-pairs A673_Bio1_DBD_mapped.pairs --output-sam -|samtools view -bS -@16 | \
samtools sort -@16 -o A673_Bio1_DBD_mapped.PT.bam;samtools index A673_Bio1_DBD_mapped.PT.bam
#qc
python3 /export/apps/opt/Micro-C/get_qc.py -p A673_Bio1_DBD_microc_stat.txt

2. Ran Juicer pre to create hic matrix as suggested by (https://micro-c.readthedocs.io/en/latest/contact_map.html#generating-hic-contact-maps-using-juicer-tools)

a. Exact code used to create hic matrix
java -Xmx48000m -Djava.awt.headless=true -jar /juicer_tools_1.22.01.jar pre --threads 16 *_mapped.pairs A673_"$dir"_matrix.hic ../matrix/hg19.chrom.sizes.clean

b. When HicRep tool was used to calculate SCC (https://github.com/TaoYang-dev/hicrep), some chromosomes dimensions of the matrix don't match with the other replicate dimension.
See below:

#importing matrices from hic using hicrep hic2mat command
#Rep1 of DBD condition
DBD1_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat1 <-
    hic2mat(
        file = "/Bio1_DBD/A673_Bio1_DBD_matrix.hic",
      chromosome1 = i,
      chromosome2 = i,
      resol = 50000,
      method = "KR"
    )
  DBD1_matrices[[i]] <- mat1
}

#Rep 2 pf DBD condition
DBD2_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat2 <-
    hic2mat(
      file = "/Bio2_DBD/A673_Bio2_DBD_matrix.hic",
      chromosome1 = i,
      chromosome2 = i,
      resol = 50000,
      method = "KR"
    )
  DBD2_matrices[[i]] <- mat2
}

#dimensions of chr3 in both replicates
dim(DBD1_matrices[["chr3"]])
[1] 3960 3960

dim(DBD2_matrices[["chr3"]])
[1] 3959 3959

3. Used straw to 3 columned text file

a. straw NONE ../Bio1_DBDp/A673_Bio1_DBDp_matrix.hic "$chr" "$chr" BP 5000 > 5kb/DBDp1/A673_DBDp1_"$chr"_NONE_5kb.txt

b. After improting the txt file into R, DBD1 has 1,855,898 observartions versus DBD2 has 1,118,955
See below for the mismatching genomic coordinates of the bins.

DBD1_21$region1 (start of genomic location of region 1)
   [1]  9410000  9410000  9415000  9415000  9420000  9415000  9420000  9425000  9410000  9415000  9420000  9425000  9430000  9415000
  [15]  9420000  9425000  9430000  9435000  9420000  9425000  9430000  9435000  9440000  9435000  9440000  9445000  9415000  9430000
  [29]  9435000  9440000  9445000  9450000  9420000  9425000  9430000  9435000  9440000  9450000  9455000  9440000  9450000  9455000
  [43]  9460000  9425000  9435000  9455000  9460000  9465000  9415000  9425000  9430000  9435000  9440000  9450000  9455000  9460000
  [57]  9465000  9470000  9415000  9420000  9425000  9430000  9450000  9455000  9465000  9470000  9475000  9430000  9470000  9475000
  [71]  9480000  9435000  9450000  9470000  9480000  9485000  9415000  9425000  9460000  9470000  9480000  9485000  9490000  9440000
  [85]  9475000  9495000  9435000  9440000  9495000  9500000  9505000  9415000  9445000  9475000  9505000  9510000  9415000  9430000
  [99]  9445000  9470000  9515000  9515000  9520000  9450000  9490000  9505000  9525000  9410000  9525000  9530000  9420000  9425000

DBD2_21$region1 (start of genomic location of region 2)
   [1]  9410000  9410000  9415000  9415000  9420000  9415000  9420000  9425000  9415000  9425000  9430000  9415000  9420000  9425000
  [15]  9430000  9435000  9420000  9425000  9430000  9435000  9440000  9435000  9440000  9445000  9435000  9440000  9450000  9430000
  [29]  9450000  9455000  9435000  9455000  9460000  9430000  9455000  9465000  9425000  9435000  9450000  9470000  9435000  9465000
  [43]  9470000  9475000  9435000  9450000  9470000  9475000  9480000  9480000  9485000  9465000  9470000  9490000  9480000  9495000
  [57]  9435000  9495000  9500000  9505000  9510000  9515000  9495000  9520000  9515000  9520000  9525000  9505000  9530000  9415000
  [71]  9435000  9495000  9525000  9530000  9535000  9420000  9535000  9540000  9515000  9525000  9535000  9540000  9545000  9470000
  [85]  9550000  9555000  9475000  9550000  9555000  9560000  9530000  9560000  9565000  9565000  9570000  9435000  9535000  9560000
  [99]  9570000  9575000  9420000  9435000  9575000  9580000  9435000  9575000  9585000  9435000  9545000  9570000  9590000  9435000

((DBD1_21$region1[1:1118955] - DBD2_21$region1[1:1118955]))  #0 means the genomic coordinates matched, any other number means the coordinates shifted by that much
   [1]        0        0        0        0        0        0        0        0    -5000   -10000   -10000    10000    10000   -10000
  [15]   -10000   -10000    10000    10000   -10000   -10000   -10000        0        0   -10000     5000     5000   -35000        0
  [29]   -15000   -15000    10000    -5000   -40000    -5000   -25000   -30000    15000    15000     5000   -30000    15000   -10000
  [43]   -10000   -50000        0     5000   -10000   -10000   -65000   -55000   -55000   -30000   -30000   -40000   -25000   -35000
  [57]    30000   -25000   -85000   -85000   -85000   -85000   -45000   -65000   -50000   -50000   -50000   -75000   -60000    60000
  [71]    45000   -60000   -75000   -60000   -55000    65000  -120000  -115000   -55000   -55000   -55000   -55000   -55000   -30000
  [85]   -75000   -60000   -40000  -110000   -60000   -60000   -25000  -145000  -120000   -90000   -65000    75000  -120000  -130000
  [99]  -125000  -105000    95000    80000   -55000  -130000    55000   -70000   -60000   -25000   -20000   -40000  -170000   -10000

which( (DBD1_21$region1[1:1118955] - DBD2_21$region1[1:1118955])>0) #shows slots with differing genomic location
   [1]   12   13   17   18   25   26   31   37   38   39   41   46   57   70   71   76   96  101  102  105  113  114  115  116  117  118
  [27]  120  121  122  124  125  126  127  128  130  132  134  135  138  141  142  143  144  145  146  147  149  150  153  154  155  156d
  [53]  157  158  159  160  162  163  164  166  167  168  169  173  174  177  179  180  182  185  188  192  193  195  500  510  554  581
  [79]  616  642  651  682  697  700  701  702  703  704  720  721  726  727  728  729  736  737  738  739  740  741  750  751  752  754

Ariunaa Bayanjargal

unread,
Apr 15, 2023, 8:54:01 PM4/15/23
to 3D Genomics

Hello there,

I am working with micro-c data and used pre to make hic matrices. However, I am running into issues arising from the different number of dimensions for same resolution across my samples. My understanding is that since I am using hg19 chrom size files to partition to get the resolution, all the dimensions of my samples should be same as long as the resolution is the same. This is not true in my data. For example, at 50kb, chr3, chr5, and chr16 of my 2 biological replicates have differing dimensions. In chr3, there is 1 extra dimension in my rep 1. In chr 5, there is 4 extra dimensions in my rep 1. Chr 16 has 1 extra dimension in rep 1. Some extra dimensions/slots have 0s, some have 0s and non-zero values.

Then I checked for the genomic coordinates of the bins in txt file extracted with straw. I am finding that number of observations for each of my samples differ widely (looks like 0 bins were excluded with straw) and their coordinates don’t match.

Could someone please take a look and advice on this matter? Any help you can provide is greatly appreciated.

The following is detailed step-by-step to demonstrate the issue. 

1. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)

a. Exact script used to make .pairs file

#align to final valid pairs

bwa mem -5SP -T0 -t16 /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/BWAIndex/genome.fa A673_Bio1_DBD_R1_comb.fastq.gz A673_Bio1_DBD_R2_comb.fastq.gz | \


pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in 16 --nproc-out 16 --chroms-path /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/WholeGenomeFasta/genome.fa | \

pairtools sort --tmpdir=/gpfs0/home1/gdlessnicklab/rsaxb003/temp/ --nproc 16 |


pairtools dedup --nproc-in 16 --nproc-out 16 --mark-dups --output-stats A673_Bio1_DBD_microc_stat.txt |
pairtools split --nproc-in 16 --nproc-out 16 --output-pairs A673_Bio1_DBD_mapped.pairs --output-sam -|samtools view -bS -@16 | \
samtools sort -@16 -o A673_Bio1_DBD_mapped.PT.bam;samtools index A673_Bio1_DBD_mapped.PT.bam
#qc
python3 /export/apps/opt/Micro-C/get_qc.py -p A673_Bio1_DBD_microc_stat.txt

2. Ran Juicer pre to create hic matrix as suggested by (https://micro-c.readthedocs.io/en/latest/contact_map.html#generating-hic-contact-maps-using-juicer-tools)

a. Exact code used to create hic matrix

java -Xmx48000m -Djava.awt.headless=true -jar /gpfs0/home1/gdlessnicklab/lab/software/Micro-C/juicer_tools_1.22.01.jar pre --threads 16 *_mapped.pairs A673_"$dir"_matrix.hic ../matrix/hg19.chrom.sizes.clean



b. When HicRep tool was used to calculate SCC (https://github.com/TaoYang-dev/hicrep), some chromosomes dimensions of the matrix don't match with the other replicate dimension.
See below:

#importing matrices from hic using hicrep hic2mat command
#Rep1 of DBD condition
DBD1_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat1 <-
    hic2mat(

        file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio1_DBD/A673_Bio1_DBD_matrix.hic",


      chromosome1 = i,
      chromosome2 = i,
      resol = 50000,
      method = "KR"
    )
  DBD1_matrices[[i]] <- mat1
}

#Rep 2 pf DBD condition
DBD2_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat2 <-
    hic2mat(

      file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio2_DBD/A673_Bio2_DBD_matrix.hic",

  [79]  616  642  651  682  697  700  701  702  703  704  720  721  726  727  728  729  736  737  738  739  740  741  750  751  752  7541. fastq files from Micro-c experiment was processed with dovetail genomics suggested steps (https://micro-c.readthedocs.io/en/latest/fastq_to_bam.html)


a. Exact script used to make .pairs file

#align to final valid pairs
bwa mem -5SP -T0 -t16 /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/BWAIndex/genome.fa A673_Bio1_DBD_R1_comb.fastq.gz A673_Bio1_DBD_R2_comb.fastq.gz | \

pairtools parse --min-mapq 40 --walks-policy 5unique \
--max-inter-align-gap 30 --nproc-in 16 --nproc-out 16 --chroms-path /gpfs0/export/reference/homo_sapiens/hg19/ucsc_assembly/illumina_download/Sequence/WholeGenomeFasta/genome.fa | \
pairtools sort --tmpdir=/gpfs0/home1/gdlessnicklab/rsaxb003/temp/ --nproc 16 |

pairtools dedup --nproc-in 16 --nproc-out 16 --mark-dups --output-stats A673_Bio1_DBD_microc_stat.txt |
pairtools split --nproc-in 16 --nproc-out 16 --output-pairs A673_Bio1_DBD_mapped.pairs --output-sam -|samtools view -bS -@16 | \
samtools sort -@16 -o A673_Bio1_DBD_mapped.PT.bam;samtools index A673_Bio1_DBD_mapped.PT.bam
#qc
python3 /export/apps/opt/Micro-C/get_qc.py -p A673_Bio1_DBD_microc_stat.txt

2. Ran Juicer pre to create hic matrix as suggested by (https://micro-c.readthedocs.io/en/latest/contact_map.html#generating-hic-contact-maps-using-juicer-tools)

a. Exact code used to create hic matrix
java -Xmx48000m -Djava.awt.headless=true -jar /gpfs0/home1/gdlessnicklab/lab/software/Micro-C/juicer_tools_1.22.01.jar pre --threads 16 *_mapped.pairs A673_"$dir"_matrix.hic ../matrix/hg19.chrom.sizes.clean


b. When HicRep tool was used to calculate SCC (https://github.com/TaoYang-dev/hicrep), some chromosomes dimensions of the matrix don't match with the other replicate dimension.
See below:

#importing matrices from hic using hicrep hic2mat command
#Rep1 of DBD condition
DBD1_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat1 <-
    hic2mat(
        file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio1_DBD/A673_Bio1_DBD_matrix.hic",

      chromosome1 = i,
      chromosome2 = i,
      resol = 50000,
      method = "KR"
    )
  DBD1_matrices[[i]] <- mat1
}

#Rep 2 pf DBD condition
DBD2_matrices <- list()
for (i in paste0("chr", c(as.character(1:22), "X"))) {
  mat2 <-
    hic2mat(
      file = "/home/gdlessnicklab/lab/data/221013_Lessnick_GSL-AB-3073_A673_MicroC/Bio2_DBD/A673_Bio2_DBD_matrix.hic",
Reply all
Reply to author
Forward
0 new messages