"chr11" error during creation of HIC file

148 views
Skip to first unread message

Longzhi Tan

unread,
Oct 7, 2016, 11:15:16 PM10/7/16
to 3D Genomics

Hi Aiden lab,


Thanks a lot for your wonderful experimental help last year! I'm trying out the Juicer software, but encountered the following problem:


I successfully generated a merged_sort.txt file for my mouse data. However, the generation of the HIC file failed with the following error message, after running for a while (which I assume successfully read through chr1 and chr10, but somehow got stuck at chr11):


======

Start preprocess

Writing header

Writing body

java.lang.NumberFormatException: For input string: "chr11"

        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

        at java.lang.Integer.parseInt(Integer.java:492)

        at java.lang.Integer.parseInt(Integer.java:527)

        at juicebox.tools.utils.original.AsciiPairIterator.advance(AsciiPairIterator.java:145)

        at juicebox.tools.utils.original.AsciiPairIterator.next(AsciiPairIterator.java:187)

        at juicebox.tools.utils.original.Preprocessor.computeWholeGenomeMatrix(Preprocessor.java:433)

        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:311)

        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:223)

        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:98)

        at juicebox.tools.HiCTools.main(HiCTools.java:77)

======


I was wondering what I can do the avoid this error? I did have to remove "-Xgcthreads1" from juicebox48g because our cluster doesn't support that option; and I hope this modification is not the cause.


Thanks a lot in advance!!


Best,

Tan

Longzhi Tan

unread,
Oct 7, 2016, 11:26:47 PM10/7/16
to 3D Genomics
Following advice from Neva and Muhammad, I used awk to find lines with abnormal number of fields. Indeed I found a few lines (out of millions) that don't have 16 fields:
27763310:26
39934289:26
41542982:30
41548842:29
56466077:26
57856984:29
92201268:29
97600660:29
97772514:24
97784425:19
97813710:31
97837619:31
97849222:28
129806820:31
147472759:26
147575823:28
147729463:29
162187733:21
162199478:26
162208097:31

For example, line 27763310 is:

0 chr11 67186231 159176 16 chr11 67186446 159178 60 100M CATAC0 chr11 81174809 194924 16 chr11 87515044 210731 60 100M CTCAGGCTAGAGACCCTCTAGGATGTGTGTTTCCCTAAGGACGTGTGCATCTGAGTTGGTCATTAACGTCAAACTAGAACAGCCACAGCCTCATCTTCAG 60 100M AAATGCAGACATCTCTTAGCAAGGTCCTCTTGGTTTAGGAGATATGGCTGCTATGATGGAATTCTAAGATAGCTATTCTCTCTAAACCATTCCAAACCAA ILLUMINA-D00365:497:H3VC2BCXX:2:2103:5197:99782/1 ILLUMINA-D00365:497:H3VC2BCXX:2:2103:5197:99782/2


And line 39934289 is:

16 chr12 3109979 303 16 chr2 98662500 235918 0 100M TCACGGAAAATGAGAAATACACAACTT16 chr12 3109898 303 16 chr2 98667206 235923 10 68S32M ATTTCATAATTTTTCAATTCGTCAATTGGATGTTTCTCATTTTCCAATTGACGAATTGAAAAATTATGAAATCACTGAAAATCACGGAAAATGAGAAATA 7 5M1D95M ACGTGAAAATGAAAAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAATATGG ILLUMINA-D00365:497:H3VC2BCXX:1:2208:18610:68121/2 ILLUMINA-D00365:497:H3VC2BCXX:1:2208:18610:68121/1 


No wonder these lines crashed juicebox tools. I was wondering what caused these abnormal lines?


Best,

Tan

Neva Durand

unread,
Oct 8, 2016, 2:17:56 AM10/8/16
to Longzhi Tan, 3D Genomics
Hi Tan,

It looks like something went wrong during the merge.

I would probably just throw out the problematic lines for now, if it's only a few out of millions of reads.

Best
Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/b14a0216-e0e5-4c5a-b6b5-c3affbf0ca69%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Neva Cherniavsky Durand, Ph.D.
Staff Scientist, Aiden Lab

Alina Saiakhova

unread,
Jan 20, 2017, 6:33:56 PM1/20/17
to 3D Genomics
Hello, 

I'm having a similar issue converting a fragment pairs file into .hic format. I've sorted the contacts file in chromosome order as described in here and verified that there are no rows with inconsistent number of fields but I'm still getting this error with both juicebox_tools.7.0.jar and juicebox_tools.7.5.jar:

java.lang.NumberFormatException: For input string: "X"

at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

at java.lang.Integer.parseInt(Integer.java:580)

at java.lang.Integer.parseInt(Integer.java:615)

...


Here are the first 3 lines of my file:

100000 0 1 24625417 5265 0 1 35182373 7603

1000000 16 1 172050001 41515 16 1 172066388 41519

1000001 16 1 172050001 41515 16 1 28437257 6096


The rows containing chrX look like this:

113746309 0 X 2701562 785525 0 X 11373483 787923

113746310 0 X 2701562 785525 0 X 107449382 816032

113746311 0 X 2701562 785525 0 X 113098887 817831


There are always 9 fields per row-I've checked. I have tried removing chrX and chrY rows and the program seems to make it past the point where it would generate the error but I would really like to include chrX interactions in the final output if possible. Also I have tried taking just the top 10 chr1 pairs and the top 10 chrX pairs to make a small test file (the original file has 630 million contacts) and the program generates the test .hic file just fine. 

I was hoping you could help me troubleshoot this issue or suggest a workaround. 

Thank you so much, 

Alina

On Friday, October 7, 2016 at 11:15:16 PM UTC-4, Longzhi Tan wrote:

Neva Durand

unread,
Jan 21, 2017, 1:15:36 AM1/21/17
to Alina Saiakhova, 3D Genomics
Hello, 

This isn't the proper 9 field format.  It looks like you have the read name as the first field, so this should be an 11 field format.  See this link for more information:

You can add in dummy values for mapq (100 e.g.).

I'm not sure this will work as the other things you describe are strange, but try it in any event, since right now Juicebox is taking your last field as the bin score (not the behavior you want).

Best
Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alina Saiakhova

unread,
Jan 23, 2017, 4:11:54 PM1/23/17
to 3D Genomics, alish...@gmail.com
I removed the read name column from the fragment pairs file and was able to successfully generate the .hic file. Thank you so much for helping me figure this out!

Alina
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages