genomeGenerate memory leak

95 views
Skip to first unread message

Brendan

unread,
Feb 24, 2015, 10:40:46 AM2/24/15
to rna-...@googlegroups.com
Hi Alex,

As a bit of background, I'm running two-pass alignment manually. I'm aligning to gencode-indexed hg38 (including unassembled contigs), re-indexing the genome with the splice junctions from the first pass, and aligning again in a second pass.  I'm aware you implemented an in-memory two-pass alignment now, which is great, but since it forces you to align de-novo in the first pass it didn't fit what I was trying to do.  In my experience, using annotation in the first pass preempts junctions with ambiguous genomic contexts (like, off by a couple nucleotides).  Please let me know if you're planning to enable this though!

Recently, I've been debugging a memory leak that's triggered by the genome indexing (genomeGenerate) step.  What's amazing about the leak is that the memory stays leaked on the computer beyond the lifespan of the STAR process.  Here's the code I'm running:

STAR --runMode genomeGenerate --genomeDir $OUTPUT_REPO/$NAME/genome_bootstrap \
     --genomeFastaFiles $HG_FASTA --runThreadN $THREADS --limitGenomeGenerateRAM 31000000000 \
     --sjdbOverhang 125 --sjdbFileChrStartEnd $OUTPUT_REPO/$NAME/pass1/SJ.out.tab > Log.std.out

It's pretty standard, and the SJ.out.tab comes directly from the first pass's output.  Here's some example output from my testing script:

log - start: checking memory with `free -m`
             total       used       free     shared    buffers     cached
Mem:        258281     231350      26931      16241        127     228035
-/+ buffers/cache:       3187     255094
Swap:        32767      11346      21421

log - RUNNING GENOMEGENERATE

log - end: checking memory with `free -m`
             total       used       free     shared    buffers     cached
Mem:        258281     225210      33070      16149        127     189458
-/+ buffers/cache:      35624     222657
Swap:        32767      11437      21330

You can see that about ~32gb of memory is gone from the computer.  It's not attributed to any process, and when it runs out, heap allocations (new's) start failing.  On computing clusters this has been giving me a lot of trouble, because subsequent jobs don't get the memory they need and die horribly.  Also of note, the leak only happens in ~20% to ~50% of jobs, i.e., it happens randomly.  I thought it might be related to multithreading, but when I compiled openmp out of STAR it still happened, so that seems unlikely now.

I've reproduced the leak with both STAR 2.3.1z12, and 2.4.0j, and on two separate computing clusters.  I'm attaching a couple log files from a job that leaked, though it's not discernibly different from a job that didn't.  Let me know if you have any insights, or workarounds.

Thanks!
Brendan
Log.out
Log.std.out

Alexander Dobin

unread,
Feb 26, 2015, 3:03:33 PM2/26/15
to rna-...@googlegroups.com
Hi Brendan,

it's very strange that the memory is not freed after STAR exits, the OS is supposed to do it as far as I understand.
So my first guess would that this is some kind of library incompatibility.
Can you try both static and dynamic pre-compiled executables (I am not sure which one you are using)?
Also, could you try to compile STAR from the source?

Cheers
Alex

Brendan

unread,
Feb 26, 2015, 3:27:57 PM2/26/15
to rna-...@googlegroups.com
Hey Alex,

I agree - it's not something that's supposed to be possible in linux.  I tried all those - dynamic and static pre-compiled executables, re-compiling myself, and compiling out openmp.

Most recently, I figured out that using the hard drives of the execute nodes worked, while the larger filesystems (lustre, and nfs) did not.  My best guess now is:

- STAR does not free all the memory it uses in indexing, but normally the OS handles this
- A randomly reproducible race condition involving a file lock is triggered (~50% of indexing runs)
- The file lock is still intact, so one or more of STAR's processes is still running when the OS tries to free memory
- The OS sees the process still running, and gives up on freeing the memory
- The process ends normally after that

Normally the OS acts as a safety net, and frees all memory attributed to a process, but the race condition interferes, and it has something to do with latency on the network filesystem.  This is all pretty speculative obviously - all I know is that local disks work, and one example each of nfs and lustre don't.

I believe the workaround of using local storage will be fine for me, but I think this is something to look out for.

Thanks!
Brendan

Alexander Dobin

unread,
Mar 2, 2015, 4:53:01 PM3/2/15
to rna-...@googlegroups.com
Hi Brendan,

this sounds like a plausible theory. Even though these days it's considered a waster of resources, I now feel it's more prudent to explicitly deallocate all the arrays - please check out the patch from GitHub master: https://github.com/alexdobin/STAR/archive/master.zip

Cheers
Alex

Brendan

unread,
Mar 4, 2015, 1:12:33 PM3/4/15
to rna-...@googlegroups.com
Hey Alex,

Thanks!  I'll check it out.

Brendan
Reply all
Reply to author
Forward
0 new messages