merge taking up too much memory

Jordi Camps

unread,

Apr 28, 2021, 7:12:39 AM4/28/21

to sambamba-discussion

Hello,

in a recent run of sambamba merge (0.7.0) I observed an incredibly large memory usage for what should be virtually zero.
According to SLURM accounting, an execution took 2572s and used 6.33GiB of memory.

That could make sense if it were a sort, but it was just a merge of already sorted inputs, so the theoretical minimum memory need is just one record per input file. There were only 4 input files and the output file size is 182GiB (compression level 2).

Buffers can explain some memory consumption, but you don't need GiB-level buffers for optimal IO performance.

How much memory should I allocate when running a merge? A function of the input size? A fixed quantity?

This is clearly a memory allocation bug. Probably some kind of memory leak, as there is no reason that explains the huge memory usage. Or there's some reason why it is using this much memory?

Pjotr Prins

unread,

Apr 28, 2021, 10:49:52 AM4/28/21

to Jordi Camps, sambamba-discussion

Hi Jordi,

Merge does some administration, especially with linked reads.

Furthermore if the read depth is high there may be problems. Some
datasets have very high depth on recent sequencers. Take a look at
your data and the source code to get some idea. I don't think it is a
bug or leak. Also 6GB of memory is not extreme for a dataset this
size.

If you compile sambamba with debug mode you can actually track what
the garbage collector is doing. ldc has support for that.

Pj.

> --
> You received this message because you are subscribed to the Google
> Groups "sambamba-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [1]sambamba-discus...@googlegroups.com.
> To view this discussion on the web, visit
> [2]https://groups.google.com/d/msgid/sambamba-discussion/2ea5f837-f264-
> 479d-932f-6f2360a02466n%40googlegroups.com.
>
> References
>
> 1. mailto:sambamba-discus...@googlegroups.com
> 2. https://groups.google.com/d/msgid/sambamba-discussion/2ea5f837-f264-479d...@googlegroups.com?utm_medium=email&utm_source=footer

Jordi Camps

unread,

Aug 17, 2021, 7:08:21 AM8/17/21

to sambamba-discussion

Sorry for the very late answer. I've been looking for time to check the sources, but I just could give them a glimpse.

I understand that you are doing some checks in the process, some of them storing the read IDs, so the size goes up. While I understand that some checking can be useful, specially those cheap checks that can lead to early error detection, my opinion is that there should be a specialized tool for that. 6GiB is not a extreme memory size for validating that dataset, but it is extreme for a merge without validation.

The main point is: if all tools perform the same checks, all tools are devoting time and space to perform the same task over and over, increasing time and resources along the way. If I get the output from an aligner and use sambamba view to convert to bam, then pipe it to sambamba sort and finally pipe it to sambamba markdup (not sure if all this piping can be done, just an example), you are repeating the same process three times and asking for 6x3GiB extra memory. Here I'm assuming you are doing the same checks everywhere for consistency. Why would you do some checks in the merge but no in other places?

My opinion is that it is a lot better to have a validation tool that can be executed before/after all the manipulations, and all the intemediate manipulations can be run at top speed with low resources. That means that you stop baby-sitting the users and give them the responsibility for validating their data.

Just an opinion, of course you are free to do whatever seems more convenient to you :-)

Let me know if I can be of any help, but I never programmed in D, so probably not much useful there :-/

El dia dimecres, 28 d’abril de 2021 a les 16:49:52 UTC+2, pjotr...@gmail.com va escriure:

Reply all

Reply to author

Forward