BAMsortRAM requirement and default value?

Sol Katzman

unread,

Jun 23, 2015, 7:52:17 PM6/23/15

to rna-...@googlegroups.com

Dear Alex,

STAR version 2.4.0h

Seeing poor performance on our cluster nodes, the sysadmin told me that
my numerous star jobs were "swapping like crazy".

That seemed unlikely to be due to alignment, since I had allocated 32G
(and 4 cpus) to each job. I suspect it was occurring at the "sorting BAM" step.
Below are time stamps from a few of the jobs.

I did not specify "--limitBAMsortRAM" and the documentation says:
"it will be set to the genome index size".

That seems a little odd because I would guess that the need for RAM
is dependent on the number of mappings; the index size seems irrelevant.

In this case, I had over 80M mapped paired-end fragments for many of the jobs.
And they used a mouse target genome (built with a gene model), having these file sizes
in the genome directory:

-rw-rw-r-- 1 sol 2.6G Dec 16 2014 Genome
-rw-rw-r-- 1 sol 21G Dec 16 2014 SA
-rw-rw-r-- 1 sol 1.5G Dec 16 2014 SAindex

So which of the above is the default? 1.5G, 2.6G, or 21G?

I see that release of version 2.4.0k says:
"Implemented new BAM sorting algorithm that reduces significantly the required RAM"

And I see that after version 2.4.1 I don't need to include the gene model in the built index.

So I will upgrade and test those things out, but I would appreciate it if you
could clarify the default for the parameter "--limitBAMsortRAM"

Thanks so much for all your efforts.
/Sol Katzman

13.log:Jun 22 10:17:36 ..... Started STAR run
13.log:Jun 22 10:38:18 ..... Started mapping
13.log:Jun 22 12:24:11 ..... Started sorting BAM
13.log:Jun 22 12:58:35 ..... Finished successfully

23.log:Jun 22 10:17:38 ..... Started STAR run
23.log:Jun 22 10:38:16 ..... Started mapping
23.log:Jun 22 12:40:46 ..... Started sorting BAM
23.log:Jun 22 13:06:12 ..... Finished successfully

3.log:Jun 22 10:17:38 ..... Started STAR run
3.log:Jun 22 10:38:23 ..... Started mapping
3.log:Jun 22 12:40:37 ..... Started sorting BAM
3.log:Jun 22 13:06:58 ..... Finished successfully

33.log:Jun 22 10:17:36 ..... Started STAR run
33.log:Jun 22 10:38:18 ..... Started mapping
33.log:Jun 22 11:53:59 ..... Started sorting BAM
33.log:Jun 22 12:49:04 ..... Finished successfully

Alexander Dobin

unread,

Jun 25, 2015, 12:36:05 PM6/25/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Hi Sol,

the default value of --limitBAMsortRAM is a sum of the Genome+SA+SAindex file size. The idea is that since we have already used that much RAM for mapping, we can use the same amount for sorting - the RAM used for genome indices will be de-allocated before sorting. (This is not true for the shared memory option --genomeLoad, in which case you have to explicitly specify the RAM for sorting).

The new BAM sorting algorithm requires much less maximum RAM and will hopefully solve your problems. I would set --limitBAMsortRAM to a few GB below you RAM allocation, say 30GB.

Note, that sorting still does quite a lot of disk reading/writing. If you have a large enough local disk on each of your cluster nodes, you can specify the local disk in the --outTmpDir to reduce network traffic.

Cheers

Alex

Sol Katzman

unread,

Jun 25, 2015, 4:01:13 PM6/25/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Dear Alex,

Thanks for the explanation. That clarifies things.

It also explains why, when I was using the (small) yeast genome as a target, I got
a very helpful error message that BAMsortRAM was too small. Thanks for that!

I did some experiments (before I saw your response to this thread), on 10 samples,
run 4 different ways:

TEST1: 2.4.0h -- BAMsortRAM default
TEST2: 2.4.0h -- BAMsortRAM 12G
TEST3: 2.4.1d -- BAMsortRAM default
TEST4: 2.4.1d -- BAMsortRAM 12G

Unfortunately, the results are quite mixed. Sometimes the mapping steps were short (5-10 mins),
other times very long (about 1 hr). Likewise for the sorting step. No real consistency across tests.

So I will chalk this up to something about our cluster nodes getting bogged down that is independent of STAR.

Incidentally, I run everything from the local scratch disk of the cluster node, and each cluster node
has a copy of the genome index on the local scratch disk. So network traffic is probably not the issue.

And I just checked and found that /scratch has about 400G free per node. At most I can run 8 jobs
with 4 cpus each on these 32-cpu nodes. So maybe /scratch gets a little too busy.

Well, not your problem.

Cheers,
/Sol.

Alexander Dobin

unread,

Jun 30, 2015, 12:38:17 PM6/30/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Hi Sol,

why are you limiting the number of threads to 4? The mapping grows just slightly slower than linearly with the number of threads, at least until the point where disk writing becomes the bottleneck.

For consistent benchmarking you would need to reserve the entire node for the STAR run (e.g. by requesting 32 cpus).

Note, that the number of threads used in sorting is defined by --outBAMsortingThreadN, by default it's min (6, runThreadN).

Cheers

Alex

Sol Katzman

unread,

Jun 30, 2015, 2:35:55 PM6/30/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Dear Alex,

The cluster where I run is controlled by the "parasol" queueing system, designed by Jim Kent over
a decade ago. Most of the UCSC Genome browser tracks are generated there. The intent is that any
task should be divided into many jobs that should each complete in minutes. Cpu-level threading was
apparently not a big design consideration. Parasol is very good at efficiently queuing and dispatching
tens, or even hundreds of thousands of jobs.

But I guess parasol prioritizes throughput over latency. When a cpu becomes available because a job has
completed, it will apparently dispatch a single cpu job, rather than wait for 3 more cpus to become available
in order to dispatch a 4-cpu job. So my 4-cpu jobs can get stuck waiting for a long time when the
cluster is very busy. I could wait for hours if I tried to ask for more than 4 cpus!

Another issue, perhaps more relevant to my recent results, is that parasol seems to make no effort
to dispatch jobs to different nodes, even if they have the resources available. So when I submit 40
4-cpu jobs, they seem to get dispatched to only about 5 of the 32 nodes. And for some reason that
I still do not understand, they seem to compete for some resource on each node. I am guessing that the
dispatch algorithm could be improved from that standpoint, and I might suggest it to Jim, or the person
who currently maintains parasol.

For the record, below are the results that I got running 10 jobs in each of the 4 test configurations that
I mentioned previously.

The job 6 ("6.log") results show an incredible disparity.

In TEST1 mapping was 1:06, sorting 0:19.
In TEST4 mapping was 0:08, sorting 0:04.

Other jobs show mapping in 0:08 in one TEST, 0:45 in another TEST.

As you say, I would need to reserve the whole node for a valid benchmark.

And of course, I need to say that this is all orders of magnitude better than tophat!

So thanks for STAR!

/Sol.
-----------------------------------------------------

TEST1: 2.4.0h -- BAMsortRAM default
TEST2: 2.4.0h -- BAMsortRAM 12G
TEST3: 2.4.1d -- BAMsortRAM default
TEST4: 2.4.1d -- BAMsortRAM 12G

Note: even/odd pairs of jobs (1/2 3/4 5/6...) each contain half the
reads from a single sample so should have similar characteristics.

Jobs 1 through 8 had 35M-40M input reads each. Jobs 9-10 had 20M reads each.

TEST1 1.log results    TEST2 1.log results    TEST3 1.log results    TEST4 1.log results
Jun 24 12:00:42     Jun 24 12:01:01     Jun 24 12:01:01     Jun 24 12:01:10 ..... Started STAR run
Jun 24 12:00:59     Jun 24 12:05:41     Jun 24 12:05:41     Jun 24 12:15:11 ..... Started mapping
Jun 24 12:47:43     Jun 24 12:13:25     Jun 24 12:13:20     Jun 24 13:09:28 ..... Started sorting BAM
Jun 24 13:25:35     Jun 24 12:26:11     Jun 24 12:26:04     Jun 24 13:26:28 ..... Finished successfully

TEST1 2.log results    TEST2 2.log results    TEST3 2.log results    TEST4 2.log results
Jun 24 12:00:44     Jun 24 12:00:57     Jun 24 12:01:01     Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:01:00     Jun 24 12:07:44     Jun 24 12:07:44     Jun 24 12:05:40 ..... Started mapping
Jun 24 12:31:51     Jun 24 12:18:56     Jun 24 12:17:14     Jun 24 12:43:53 ..... Started sorting BAM
Jun 24 13:00:27     Jun 24 12:23:31     Jun 24 12:21:42     Jun 24 13:01:18 ..... Finished successfully

TEST1 3.log results    TEST2 3.log results    TEST3 3.log results    TEST4 3.log results
Jun 24 12:00:41     Jun 24 12:01:00     Jun 24 12:01:01     Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:00:56     Jun 24 12:09:12     Jun 24 12:09:12     Jun 24 12:15:58 ..... Started mapping
Jun 24 12:08:51     Jun 24 12:16:07     Jun 24 12:16:02     Jun 24 13:09:23 ..... Started sorting BAM
Jun 24 12:14:20     Jun 24 12:22:27     Jun 24 12:22:26     Jun 24 13:26:11 ..... Finished successfully

TEST1 4.log results    TEST2 4.log results    TEST3 4.log results    TEST4 4.log results
Jun 24 12:00:39     Jun 24 12:00:59     Jun 24 12:00:59     Jun 24 12:01:09 ..... Started STAR run
Jun 24 12:00:56     Jun 24 12:08:51     Jun 24 12:08:51     Jun 24 12:01:29 ..... Started mapping
Jun 24 12:08:21     Jun 24 12:15:54     Jun 24 12:15:54     Jun 24 12:35:54 ..... Started sorting BAM
Jun 24 12:12:23     Jun 24 12:24:30     Jun 24 12:25:22     Jun 24 13:00:31 ..... Finished successfully

TEST1 5.log results    TEST2 5.log results    TEST3 5.log results    TEST4 5.log results
Jun 24 12:00:38     Jun 24 12:01:00     Jun 24 12:00:59     Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:00:55     Jun 24 12:08:22     Jun 24 12:08:22     Jun 24 12:05:40 ..... Started mapping
Jun 24 12:09:25     Jun 24 12:15:36     Jun 24 12:15:26     Jun 24 12:44:32 ..... Started sorting BAM
Jun 24 12:16:25     Jun 24 12:22:36     Jun 24 12:22:24     Jun 24 13:01:26 ..... Finished successfully

TEST1 6.log results    TEST2 6.log results    TEST3 6.log results    TEST4 6.log results
Jun 24 12:00:39     Jun 24 12:00:56     Jun 24 12:00:56     Jun 24 12:10:40 ..... Started STAR run
Jun 24 12:00:55     Jun 24 12:01:13     Jun 24 12:01:16     Jun 24 12:10:59 ..... Started mapping
Jun 24 13:06:01     Jun 24 12:56:39     Jun 24 12:50:09     Jun 24 12:18:19 ..... Started sorting BAM
Jun 24 13:25:50     Jun 24 13:25:09     Jun 24 13:23:10     Jun 24 12:22:39 ..... Finished successfully

TEST1 7.log results    TEST2 7.log results    TEST3 7.log results    TEST4 7.log results
Jun 24 12:00:40     Jun 24 12:00:55     Jun 24 12:00:56     Jun 24 12:13:33 ..... Started STAR run
Jun 24 12:00:53     Jun 24 12:01:15     Jun 24 12:01:14     Jun 24 12:13:50 ..... Started mapping
Jun 24 12:20:52     Jun 24 12:22:00     Jun 24 12:19:49     Jun 24 12:20:50 ..... Started sorting BAM
Jun 24 12:58:13     Jun 24 12:58:13     Jun 24 12:57:53     Jun 24 12:25:09 ..... Finished successfully

TEST1 8.log results    TEST2 8.log results    TEST3 8.log results    TEST4 8.log results
Jun 24 12:00:45     Jun 24 12:01:01     Jun 24 12:01:01     Jun 24 12:16:06 ..... Started STAR run
Jun 24 12:01:04     Jun 24 12:08:51     Jun 24 12:05:41     Jun 24 12:16:27 ..... Started mapping
Jun 24 12:09:30     Jun 24 12:16:15     Jun 24 12:13:00     Jun 24 12:23:58 ..... Started sorting BAM
Jun 24 12:14:50     Jun 24 12:25:22     Jun 24 12:26:10     Jun 24 12:27:34 ..... Finished successfully

TEST1 9.log results    TEST2 9.log results    TEST3 9.log results    TEST4 9.log results
Jun 24 12:00:32     Jun 24 12:00:54     Jun 24 12:00:54     Jun 24 12:16:02 ..... Started STAR run
Jun 24 12:00:47     Jun 24 12:08:22     Jun 24 12:02:35     Jun 24 12:16:23 ..... Started mapping
Jun 24 12:05:00     Jun 24 12:11:48     Jun 24 12:06:49     Jun 24 12:19:58 ..... Started sorting BAM
Jun 24 12:09:50     Jun 24 12:14:20     Jun 24 12:09:20     Jun 24 12:22:11 ..... Finished successfully

TEST1 10.log results    TEST2 10.log results    TEST3 10.log results    TEST4 10.log results
Jun 24 12:00:36     Jun 24 12:00:51     Jun 24 12:00:51     Jun 24 12:17:25 ..... Started STAR run
Jun 24 12:00:50     Jun 24 12:01:09     Jun 24 12:09:12     Jun 24 12:18:27 ..... Started mapping
Jun 24 12:19:17     Jun 24 12:17:09     Jun 24 12:12:39     Jun 24 12:21:59 ..... Started sorting BAM
Jun 24 13:00:17     Jun 24 12:52:43     Jun 24 12:14:54     Jun 24 12:27:03 ..... Finished successfully

Sol Katzman

unread,

Jun 30, 2015, 3:55:35 PM6/30/15

to rna-...@googlegroups.com

For completeness, these are the nodes where the 40 jobs ran.

TEST1_node      TEST2_node      TEST3_node      TEST4_node     job
ku-1-05.local   ku-1-10.local   ku-1-10.local   ku-1-05.local para.exec.csh-1
ku-1-04.local   ku-1-09.local   ku-1-09.local   ku-1-04.local para.exec.csh-2
ku-1-03.local   ku-1-08.local   ku-1-08.local   ku-1-05.local para.exec.csh-3
ku-1-02.local   ku-1-07.local   ku-1-07.local   ku-1-04.local para.exec.csh-4
ku-1-01.local   ku-1-06.local   ku-1-06.local   ku-1-04.local para.exec.csh-5
ku-1-05.local   ku-1-05.local   ku-1-05.local   ku-1-01.local para.exec.csh-6
ku-1-04.local   ku-1-04.local   ku-1-04.local   ku-1-02.local para.exec.csh-7
ku-1-03.local   ku-1-07.local   ku-1-10.local   ku-1-03.local para.exec.csh-8
ku-1-01.local   ku-1-06.local   ku-1-09.local   ku-1-03.local para.exec.csh-9
ku-1-05.local   ku-1-05.local   ku-1-08.local   ku-1-10.local para.exec.csh-10

Alexander Dobin

unread,

Jul 1, 2015, 4:56:46 PM7/1/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Hi Sol,

in the job 6 case, TEST1-3 were all run on the same node and took 1.5 hours, and TEST4 was run on another node, at took 12 min.

This is a quite severe competition for resources... Most likely it's disk writing, however, could be RAM issues as well - how much RAM do you have per node?

If it's common to have multiple jobs per node, it might be helpful to try using sharing memory (e.g. --genomeLoad LoadAndRemove for each of the jobs).

Cheers

Alex

Sol Katzman

unread,

Jul 2, 2015, 2:06:32 PM7/2/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Dear Alex,

Thanks for being willing to help track down this issue. Maybe we can find a better fix than simply "throwing more resources at it".

Below, for the 2 cases run with STAR version 2.4.1d (TEST3, TEST4), I extracted the "Loading genome" time stamps
(which were not available in 2.4.0h logs.)

You can see that loading the genome often completed in under one minute, in other cases it took up to 15 minutes.

[You can also see in all cases that there are only a few seconds between "Started STAR run" and "Loading genome".
So for the 2.4.0h cases, the loading genome time is presumably nearly all the time between "Started STAR run" and "Started mapping".
Again, often under a minute, but sometimes 5-10 minutes.]

So I think the strategy of having the genome index on the local /scratch disk, and having 32G of RAM allocated to the
job usually works.

As far as RAM, I verified that each node does have 256G available. Since there are 32-cpus per node, the parasol default
is to allocate 1 cpu and 8G to a job. (There is apparently 10G of swap space per node.) And allocating 32G for a 4-cpu
job should not exhaust the RAM resources on the node, since at most 8 such jobs should be running on any one node.

So I think this is probably a disk access issue. But what data structures does STAR write to disk? What kind of buffering
is used? Is there anything that would cause the genome index to get swapped out during the mapping step? Similarly,
is there anything "interesting" that could be controlled in the sorting step?

Finally, I confess that I don't really understand how the "--genomeLoad" parameter interacts with the OS, and how
it would interact with the parasol dispatching system. I don't think parasol has any concept of sharing the
memory among different jobs, but I am not much of a linux expert.

If there is any more monitoring that I could do of the jobs in progress, I am happy to do some more testing. The cluster
is likely to be pretty lightly used over the upcoming holiday weekend.

/Sol.

--------------------------------------------------------------

Snapshot of one cluster node:

(sol@ku-1-05)/cluster/home/sol$ free -m
             total       used       free     shared    buffers     cached
Mem:        258457      43961     214496          0        489      31857
-/+ buffers/cache:      11614     246843
Swap:         9999         40       9959

(sol@ku-1-05)/cluster/home/sol$ cat /proc/meminfo
MemTotal:       264660816 kB
MemFree:        219642976 kB
Buffers:          500796 kB
Cached:         32622272 kB
SwapCached:        15484 kB
Active:         24209988 kB
Inactive:        9007572 kB
Active(anon):      50000 kB
Inactive(anon):    44528 kB
Active(file):   24159988 kB
Inactive(file): 8963044 kB
Unevictable:     8388608 kB
Mlocked:          245312 kB
SwapTotal:      10239992 kB
SwapFree:       10198016 kB

---------------------------------------------------------

TEST3 1.log results TEST4 1.log results

Jun 24 12:01:01 Jun 24 12:01:10 ..... Started STAR run

Jun 24 12:01:01 Jun 24 12:01:11 ..... Loading genome

Jun 24 12:05:41 Jun 24 12:15:11 ..... Started mapping

Jun 24 12:13:20 Jun 24 13:09:28 ..... Started sorting BAM

Jun 24 12:26:04 Jun 24 13:26:28 ..... Finished successfully

TEST3 2.log results TEST4 2.log results

Jun 24 12:01:01 Jun 24 12:01:12 ..... Started STAR run

Jun 24 12:01:01 Jun 24 12:01:12 ..... Loading genome

Jun 24 12:07:44 Jun 24 12:05:40 ..... Started mapping

Jun 24 12:17:14 Jun 24 12:43:53 ..... Started sorting BAM

Jun 24 12:21:42 Jun 24 13:01:18 ..... Finished successfully

TEST3 3.log results TEST4 3.log results

Jun 24 12:01:01 Jun 24 12:01:12 ..... Started STAR run

Jun 24 12:01:01 Jun 24 12:01:12 ..... Loading genome

Jun 24 12:09:12 Jun 24 12:15:58 ..... Started mapping

Jun 24 12:16:02 Jun 24 13:09:23 ..... Started sorting BAM

Jun 24 12:22:26 Jun 24 13:26:11 ..... Finished successfully

TEST3 4.log results TEST4 4.log results

Jun 24 12:00:59 Jun 24 12:01:09 ..... Started STAR run

Jun 24 12:00:59 Jun 24 12:01:10 ..... Loading genome

Jun 24 12:08:51 Jun 24 12:01:29 ..... Started mapping

Jun 24 12:15:54 Jun 24 12:35:54 ..... Started sorting BAM

Jun 24 12:25:22 Jun 24 13:00:31 ..... Finished successfully

TEST3 5.log results TEST4 5.log results

Jun 24 12:00:59 Jun 24 12:01:12 ..... Started STAR run

Jun 24 12:00:59 Jun 24 12:01:12 ..... Loading genome

Jun 24 12:08:22 Jun 24 12:05:40 ..... Started mapping

Jun 24 12:15:26 Jun 24 12:44:32 ..... Started sorting BAM

Jun 24 12:22:24 Jun 24 13:01:26 ..... Finished successfully

TEST3 6.log results TEST4 6.log results

Jun 24 12:00:56 Jun 24 12:10:40 ..... Started STAR run

Jun 24 12:00:57 Jun 24 12:10:44 ..... Loading genome

Jun 24 12:01:16 Jun 24 12:10:59 ..... Started mapping

Jun 24 12:50:09 Jun 24 12:18:19 ..... Started sorting BAM

Jun 24 13:23:10 Jun 24 12:22:39 ..... Finished successfully

TEST3 7.log results TEST4 7.log results

Jun 24 12:00:56 Jun 24 12:13:33 ..... Started STAR run

Jun 24 12:00:56 Jun 24 12:13:33 ..... Loading genome

Jun 24 12:01:14 Jun 24 12:13:50 ..... Started mapping

Jun 24 12:19:49 Jun 24 12:20:50 ..... Started sorting BAM

Jun 24 12:57:53 Jun 24 12:25:09 ..... Finished successfully

TEST3 8.log results TEST4 8.log results

Jun 24 12:01:01 Jun 24 12:16:06 ..... Started STAR run

Jun 24 12:01:01 Jun 24 12:16:06 ..... Loading genome

Jun 24 12:05:41 Jun 24 12:16:27 ..... Started mapping

Jun 24 12:13:00 Jun 24 12:23:58 ..... Started sorting BAM

Jun 24 12:26:10 Jun 24 12:27:34 ..... Finished successfully

TEST3 9.log results TEST4 9.log results

Jun 24 12:00:54 Jun 24 12:16:02 ..... Started STAR run

Jun 24 12:00:54 Jun 24 12:16:02 ..... Loading genome

Jun 24 12:02:35 Jun 24 12:16:23 ..... Started mapping

Jun 24 12:06:49 Jun 24 12:19:58 ..... Started sorting BAM

Jun 24 12:09:20 Jun 24 12:22:11 ..... Finished successfully

TEST3 10.log results TEST4 10.log results

Jun 24 12:00:51 Jun 24 12:17:25 ..... Started STAR run

Jun 24 12:00:51 Jun 24 12:17:37 ..... Loading genome

Jun 24 12:09:12 Jun 24 12:18:27 ..... Started mapping

Jun 24 12:12:39 Jun 24 12:21:59 ..... Started sorting BAM

Jun 24 12:14:54 Jun 24 12:27:03 ..... Finished successfully

Alexander Dobin

unread,

Jul 6, 2015, 6:59:42 PM7/6/15

to rna-...@googlegroups.com, s...@soe.ucsc.edu

Hi Sol,

loading the genome from the local disk is a good idea, this should take no more than 1-2min.

While mapping, STAR writes the alignments in a "raw' BAM format in multiple chunks. Then it loads them from disk, sorts and writes chunks in BAM format. Finally, all chunks are concatenated into one BAM file. So the writing is done 3 times for all alignments, which make it I/O intensive. The chunks are erased as they are loaded in RAM, so the maximum amount of disk space is ~ size of the uncompressed BAM file.

If the disk writing bandwidth is really the bottleneck, one thing you can try is to use part of you RAM as virtual disk via /dev/shm, and point STAR temp directory there Then only the final writing will be done onto the disk. Of course, this will only work if the output file is not too big.

The genome index should not get swapped until all the RAM is used up, which is unlikely for a 256GB machine, unless more than 8 jobs run simultaneously.

With so much RAM I do not think there is significant benefit from using shared memory, except for reduced genome loading time.