Dear Alex,
Thanks for being willing to help track down this issue. Maybe we can find a better fix than simply "throwing more resources at it".
Below, for the 2 cases run with STAR version 2.4.1d (TEST3, TEST4), I extracted the "Loading genome" time stamps
(which were not available in 2.4.0h logs.)
You can see that loading the genome often completed in under one minute, in other cases it took up to 15 minutes.
[You can also see in all cases that there are only a few seconds between "Started STAR run" and "Loading genome".
So for the 2.4.0h cases, the loading genome time is presumably nearly all the time between "Started STAR run" and "Started mapping".
Again, often under a minute, but sometimes 5-10 minutes.]
So I think the strategy of having the genome index on the local /scratch disk, and having 32G of RAM allocated to the
job usually works.
As far as RAM, I verified that each node does have 256G available. Since there are 32-cpus per node, the parasol default
is to allocate 1 cpu and 8G to a job. (There is apparently 10G of swap space per node.) And allocating 32G for a 4-cpu
job should not exhaust the RAM resources on the node, since at most 8 such jobs should be running on any one node.
So I think this is probably a disk access issue. But what data structures does STAR write to disk? What kind of buffering
is used? Is there anything that would cause the genome index to get swapped out during the mapping step? Similarly,
is there anything "interesting" that could be controlled in the sorting step?
Finally, I confess that I don't really understand how the "--genomeLoad" parameter interacts with the OS, and how
it would interact with the parasol dispatching system. I don't think parasol has any concept of sharing the
memory among different jobs, but I am not much of a linux expert.
If there is any more monitoring that I could do of the jobs in progress, I am happy to do some more testing. The cluster
is likely to be pretty lightly used over the upcoming holiday weekend.
/Sol.
--------------------------------------------------------------
Snapshot of one cluster node:
(sol@ku-1-05)/cluster/home/sol$ free -m
total used free shared buffers cached
Mem: 258457 43961 214496 0 489 31857
-/+ buffers/cache: 11614 246843
Swap: 9999 40 9959
(sol@ku-1-05)/cluster/home/sol$ cat /proc/meminfo
MemTotal: 264660816 kB
MemFree: 219642976 kB
Buffers: 500796 kB
Cached: 32622272 kB
SwapCached: 15484 kB
Active: 24209988 kB
Inactive: 9007572 kB
Active(anon): 50000 kB
Inactive(anon): 44528 kB
Active(file): 24159988 kB
Inactive(file): 8963044 kB
Unevictable: 8388608 kB
Mlocked: 245312 kB
SwapTotal: 10239992 kB
SwapFree: 10198016 kB
---------------------------------------------------------
TEST3 1.log results TEST4 1.log results
Jun 24 12:01:01 Jun 24 12:01:10 ..... Started STAR run
Jun 24 12:01:01 Jun 24 12:01:11 ..... Loading genome
Jun 24 12:05:41 Jun 24 12:15:11 ..... Started mapping
Jun 24 12:13:20 Jun 24 13:09:28 ..... Started sorting BAM
Jun 24 12:26:04 Jun 24 13:26:28 ..... Finished successfully
TEST3 2.log results TEST4 2.log results
Jun 24 12:01:01 Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:01:01 Jun 24 12:01:12 ..... Loading genome
Jun 24 12:07:44 Jun 24 12:05:40 ..... Started mapping
Jun 24 12:17:14 Jun 24 12:43:53 ..... Started sorting BAM
Jun 24 12:21:42 Jun 24 13:01:18 ..... Finished successfully
TEST3 3.log results TEST4 3.log results
Jun 24 12:01:01 Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:01:01 Jun 24 12:01:12 ..... Loading genome
Jun 24 12:09:12 Jun 24 12:15:58 ..... Started mapping
Jun 24 12:16:02 Jun 24 13:09:23 ..... Started sorting BAM
Jun 24 12:22:26 Jun 24 13:26:11 ..... Finished successfully
TEST3 4.log results TEST4 4.log results
Jun 24 12:00:59 Jun 24 12:01:09 ..... Started STAR run
Jun 24 12:00:59 Jun 24 12:01:10 ..... Loading genome
Jun 24 12:08:51 Jun 24 12:01:29 ..... Started mapping
Jun 24 12:15:54 Jun 24 12:35:54 ..... Started sorting BAM
Jun 24 12:25:22 Jun 24 13:00:31 ..... Finished successfully
TEST3 5.log results TEST4 5.log results
Jun 24 12:00:59 Jun 24 12:01:12 ..... Started STAR run
Jun 24 12:00:59 Jun 24 12:01:12 ..... Loading genome
Jun 24 12:08:22 Jun 24 12:05:40 ..... Started mapping
Jun 24 12:15:26 Jun 24 12:44:32 ..... Started sorting BAM
Jun 24 12:22:24 Jun 24 13:01:26 ..... Finished successfully
TEST3 6.log results TEST4 6.log results
Jun 24 12:00:56 Jun 24 12:10:40 ..... Started STAR run
Jun 24 12:00:57 Jun 24 12:10:44 ..... Loading genome
Jun 24 12:01:16 Jun 24 12:10:59 ..... Started mapping
Jun 24 12:50:09 Jun 24 12:18:19 ..... Started sorting BAM
Jun 24 13:23:10 Jun 24 12:22:39 ..... Finished successfully
TEST3 7.log results TEST4 7.log results
Jun 24 12:00:56 Jun 24 12:13:33 ..... Started STAR run
Jun 24 12:00:56 Jun 24 12:13:33 ..... Loading genome
Jun 24 12:01:14 Jun 24 12:13:50 ..... Started mapping
Jun 24 12:19:49 Jun 24 12:20:50 ..... Started sorting BAM
Jun 24 12:57:53 Jun 24 12:25:09 ..... Finished successfully
TEST3 8.log results TEST4 8.log results
Jun 24 12:01:01 Jun 24 12:16:06 ..... Started STAR run
Jun 24 12:01:01 Jun 24 12:16:06 ..... Loading genome
Jun 24 12:05:41 Jun 24 12:16:27 ..... Started mapping
Jun 24 12:13:00 Jun 24 12:23:58 ..... Started sorting BAM
Jun 24 12:26:10 Jun 24 12:27:34 ..... Finished successfully
TEST3 9.log results TEST4 9.log results
Jun 24 12:00:54 Jun 24 12:16:02 ..... Started STAR run
Jun 24 12:00:54 Jun 24 12:16:02 ..... Loading genome
Jun 24 12:02:35 Jun 24 12:16:23 ..... Started mapping
Jun 24 12:06:49 Jun 24 12:19:58 ..... Started sorting BAM
Jun 24 12:09:20 Jun 24 12:22:11 ..... Finished successfully
TEST3 10.log results TEST4 10.log results
Jun 24 12:00:51 Jun 24 12:17:25 ..... Started STAR run
Jun 24 12:00:51 Jun 24 12:17:37 ..... Loading genome
Jun 24 12:09:12 Jun 24 12:18:27 ..... Started mapping
Jun 24 12:12:39 Jun 24 12:21:59 ..... Started sorting BAM
Jun 24 12:14:54 Jun 24 12:27:03 ..... Finished successfully