I moved this issue from the Juicer github since it is not a bug and it may be useful to others, and because I still would like to find a workaround (since the main issue seems to be that an exceptionally large file is not being processed efficiently).
The file split0046 has 4485383 lines, that is four and a half times more than the others... that's why is four and a half times bigger, and I imagine that's why is exponentially taking longer to complete the task. After Neva's last comment I understood that the splits are made after 1millon lines if there is a clear non duplicate. So the file split0046 is correctly splitted.
Then, there would be a way to optimize the awk -f ../juicer/scripts/dups.awk run? It's a single thread process, maybe more RAM or something?
=== https://github.com/aidenlab/juicer/issues/140 ===
=== https://github.com/aidenlab/juicer/issues/140 ===
[nchernia] Oct 29, 2019, 2:53 AM GMT+1
You probably have a lot of duplicates in that file. You can use the flag “-j” with the latest version of Juicer - this will eliminate only exact duplicates instead of near duplicates. Not recommended overall but sometimes necessary. You can start at the “dedup” stage with the -j flag. Indeed this is not a bug and better on the forum.
-- Neva Cherniavsky Durand, Ph.D. Assistant Professor, Aiden Lab www.aidenlab.org
=== https://github.com/aidenlab/juicer/issues/140 ===
[you] Oct 29, 2019, 11:28 AM GMT+1
Dear devels,
I'm trying to isolate the source of the issue but it escapes me still.
Maybe this is a bug so that's why I'm writing here (sorry if not the
case).
I'm using Juicer/3D-DNA in SLURM. In my run, Juicer generates split files (split0000 to split0186) in ../aligned dir
The files sizes are mainly between 303M to 312M, with the exceptions of
split0186 that has a size of 34M and split0046 with a size of 1,4G
The file split0046 is tremendously slowing the Juicer pipeline since is running for days (>5) the command:
awk -f ../juicer/scripts/dups.awk -v name=../aligned/a1571832472_msplit0046_ ../aligned/split0046
The split0046 file has 4485383 reads, while the split0186 file has 107915. The other split files are more or less 1000000 reads.
With the exception of the parameters -S early and -t 32, everything is by default. I only modified some lines of the juicer.sh script in order to adequate it to our infrastructure (e.g.: SBATCH arguments like --qos).
Could you point me how to modify the code to avoid the generation of such a large file?
Thanks!
--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/fb531b4a-e25f-4544-85f3-3ef0a853e569%40googlegroups.com.