Unsorted read positions detected when creating .hic file with juicer and juicer_tools.jar

169 views
Skip to first unread message

Koustav Pal

unread,
May 24, 2020, 8:09:38 AM5/24/20
to 3D Genomics
Hi,

First off, the link to the forum which can be found here is broken https://github.com/aidenlab/juicer/issues/64

Next, please find below the description with relation to the subject of this issue.

I am using juicer commit f866383. However my juicer.sh is from an older release before the changes related to Arima.

My juicer_tools.jar version is 1.21.01.

During my juicer run, the pipeline failed twice.

First time, the pipeline exited with the error message

"***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files.  Run relaunch_prep.sh"

Relaunch_prep.sh directed me to relaunch the pipeline with the flag -S merge.

This time all files were created except for the .hic files, and the pipeline exited with the same error message.

"***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files.  Run relaunch_prep.sh"


This time, I narrowed down the issue to an incorrectly generated output from chimeric_blacklist.awk. For two files, the chimeric_blacklist.awk
script printed only the first column values without a trailing newline, leading to two consecutive norm.txt.res.txt files to
appear on the same line together. This lead to the pipeline exiting with this error message.

Therefore for two of these files, I executed these commands manually from the juicer.sh script

awk -v "fname1"="splits/SRR6117996080.fastq_norm.txt" -v "fname2"="splits/SRR6117996080.fastq_abnorm.sam" -v "fname3"="splits/SRR6117996080.fastq_unmapped.sam" -f Juicer_processing_scripts/scripts/chimeric_blacklist.awk splits/SRR6117996080.fastq.sam
site_file="Juicer_processing_scripts/restriction_sites/mm10noScaffoldsChrMChrY_HindIII.txt"
Juicer_processing_scripts/scripts/fragment.pl splits/SRR6117996077.fastq_norm.txt splits/SRR6117996077.fastq.frag.txt $site_file
sort -S 2G -T $tmpdir -k2,2d -k6,6d -k4,4n -k8,8n -k1,1n -k5,5n -k3,3n splits/SRR6117996077.fastq.frag.txt > splits/SRR6117996077.fastq.sort.txt
rm splits/SRR6117996077.fastq_norm.txt splits/SRR6117996077.fastq.frag.txt

Then I re-executed the pipeline with the flag -S merge. The pipeline failed with the error message asking me to download the jar file.

So, I downloaded the juicer_tools.jar for the .hic file generation.

Finally, when running the pipeline with the flag -S final, I received the error message

"Error: the chromosome combination 1_9 appears in multiple blocks"

Koustav Pal

unread,
May 24, 2020, 8:31:31 AM5/24/20
to 3D Genomics
I can also confirm that the locale of my environment is set to en_US.UTF-8 by default

Neva Durand

unread,
May 24, 2020, 9:02:19 AM5/24/20
to Koustav Pal, 3D Genomics
Thanks for the note about the broken link.

I wouldn't ignore the error about statistics. If the stats files are corrupt, it usually indicates that something has gone wrong with alignment.

Since you have the "norm.txt.res.txt" files, I would run the "stats_sub.awk" script on them and check that the numbers make sense:

awk -f $juiceDir/scripts/stats_sub.awk splits/*norm.txt.res.txt

In particular, make sure that the number of reads corresponds to the actual number of reads in your fastq files. The files "*linecount.txt" have the results of running wc -l on the fastqs, so that total divided by 4 should make the "Sequenced Reads" number reported by the stats_sub script.

Let me know how this goes.

Given that you executed a sort on the pair you determined to be corrupt, it is strange that chromosome blocks wouldn't be sorted, indicating that one of your other pairs probably also didn't complete.

Best
Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/201178fb-7a54-4d79-8109-17afab286a91%40googlegroups.com.


--
Neva Cherniavsky Durand, Ph.D.
Pronouns: she, her, hers
Assistant Professor, Aiden Lab

Koustav Pal

unread,
May 24, 2020, 5:22:43 PM5/24/20
to Neva Durand, 3D Genomics
Hi Neva,

I executed the awk command as outline and the number of reads match up.

I also checked the number of *.sort.txt files. These correspond to the number of split fastqs in the splits directory.

Might this problem be related to the initial error from the the chimeric_blacklist.awk script, where it printed the first column and not the later ones for the two files which I identified?

I also identified that chimeric_blacklist.awk appends to a file as opposed to writing to a file.

Therefore, I checked to see if the number of unique read pairs in the two sort.txt files are the same as the total number of read pairs in the files.

Number of unique read pairs = awk '{print $(NF-1),$(NF)}' splits/SRR6117996077.fastq.sort.txt | sort | uniq | wc -l
Number of read pairs in the file = wc -l splits/SRR6117996080.fastq.sort.txt

They are the same in both files, so this is also not a case of two contiguous chunks being repeated in the sorted chunks.

Cheers,
Koustav

Neva Durand

unread,
May 25, 2020, 10:54:46 AM5/25/20
to Koustav Pal, 3D Genomics
Yes, it appends because the count ligations script first puts the ligations in the first position. This file is always overwritten when the pipeline is run (though I can't speak to your version of juicer.sh ).

If you're confident that all the reads are in the *.sort.txt files, you can run a sort without the "-m" flag, though this will take a while. Or you could redo the sort on each one to be sure, and then run with the -m flag to combine into a merged_sort.txt file. I'm concerned about the chromosome block sorted error; this shouldn't happen if the files were properly sorted, and if they weren't, you might miss duplicates.If you aren't too worried about that, just chromosome-block sort your merged_nodups file and then create the hic file as usual. You can chromosome block sort via "sort -k2,2d -k6,6d"

Koustav Pal

unread,
May 25, 2020, 2:25:46 PM5/25/20
to Neva Durand, 3D Genomics
Hi Neva, 

Thanks for the suggestion. Earlier today I resorted all the chunks and re started the pipeline with the merge flag.
I have now gotten the .hic files. 

Thanks.
Koustav 
Reply all
Reply to author
Forward
0 new messages