Abnormally large split file causes a heavy delay in the pipeline

D. N. De Panis

unread,

Oct 29, 2019, 12:17:31 PM10/29/19

to 3D Genomics

I moved this issue from the Juicer github since it is not a bug and it may be useful to others, and because I still would like to find a workaround (since the main issue seems to be that an exceptionally large file is not being processed efficiently).

The file split0046 has 4485383 lines, that is four and a half times more than the others... that's why is four and a half times bigger, and I imagine that's why is exponentially taking longer to complete the task. After Neva's last comment I understood that the splits are made after 1millon lines if there is a clear non duplicate. So the file split0046 is correctly splitted.

Then, there would be a way to optimize the awk -f ../juicer/scripts/dups.awk run? It's a single thread process, maybe more RAM or something?

Very curious thing: I just checked that in a previous run of Juicer (a couple of weeks ago) using an earlier version of the genome that I'm using, there was also a large split file, with the same number of lines (4485383), but it finished the dups.awk task in 48hs. Nothing changed in the cluster, of course.

Thanks again!

D.

=== https://github.com/aidenlab/juicer/issues/140 ===

[nchernia] Oct 29, 2019, 4:35 AM GMT+1

It splits after 1000000 lines, not exactly 1000000. There needs to be a clear non duplicate for it to split.

=== https://github.com/aidenlab/juicer/issues/140 ===

[nchernia] Oct 29, 2019, 2:53 AM GMT+1

You probably have a lot of duplicates in that file. You can use the flag “-j” with the latest version of Juicer - this will eliminate only exact duplicates instead of near duplicates. Not recommended overall but sometimes necessary. You can start at the “dedup” stage with the -j flag. Indeed this is not a bug and better on the forum.

-- Neva Cherniavsky Durand, Ph.D. Assistant Professor, Aiden Lab www.aidenlab.org

=== https://github.com/aidenlab/juicer/issues/140 ===

[you] Oct 29, 2019, 11:28 AM GMT+1

Dear devels,
I'm trying to isolate the source of the issue but it escapes me still. Maybe this is a bug so that's why I'm writing here (sorry if not the case).

I'm using Juicer/3D-DNA in SLURM. In my run, Juicer generates split files (split0000 to split0186) in ../aligned dir
The files sizes are mainly between 303M to 312M, with the exceptions of split0186 that has a size of 34M and split0046 with a size of 1,4G
The file split0046 is tremendously slowing the Juicer pipeline since is running for days (>5) the command:
awk -f ../juicer/scripts/dups.awk -v name=../aligned/a1571832472_msplit0046_ ../aligned/split0046
The split0046 file has 4485383 reads, while the split0186 file has 107915. The other split files are more or less 1000000 reads.

With the exception of the parameters -S early and -t 32, everything is by default. I only modified some lines of the juicer.sh script in order to adequate it to our infrastructure (e.g.: SBATCH arguments like --qos).

Could you point me how to modify the code to avoid the generation of such a large file?
Thanks!

D. N. De Panis

unread,

Oct 29, 2019, 12:43:31 PM10/29/19

to 3D Genomics

Neva, the running time seems to improve a lot with more RAM.

I added #SBATCH --mem=8G to the corresponding places in the split_rmdups.awk script

Neva Durand

unread,

Oct 29, 2019, 1:43:05 PM10/29/19

to D. N. De Panis, 3D Genomics

The long explanation is the following:

When there are a lot of duplicates and near duplicates, the code must store all candidates together and process at the same time. This will take a lot more memory and time than the usual case. In an experiment with a small number of duplicates (we like this number to be under 10%), this bad case does not arise. Moreover, these are also often mapping to repetitive regions; an update to Juicer coming soon will shunt off MAPQ 0 reads before dedupping, which will likely eliminate the problem entirely.

In the meantime - when you run into the problem, you can add memory and time to the jobs on the cluster and just wait; or you can use the "-j" flag, which will eliminate "just exact" matches instead of near duplicates. There is no need for additional memory in that case since you don't need to check the second end for wobble. We only advise doing this when you know (because of needing a lot of time and memory) that you are likely to have lots of duplicates. With the -j flag, you will be including reads in your merged_nodups file that are artifacts.

Moshe also has a script to look for blacklist regions if you're aligning to mouse or human, which helps with this problem.

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/fb531b4a-e25f-4544-85f3-3ef0a853e569%40googlegroups.com.

Reply all

Reply to author

Forward