I have a 90X WGS tumor/normal pair that I have been trying to run. Our cluster nodes have on average 48GB RAM, but the cluster is heterogenous. I have tried submitting a SMuFin run on 20 whole nodes (unshared). After 24 hours every run I have tried has failed with an out of memory error.
Our HPC team recommended that I use our single large node with 4TB mem submitted with the cpus=160 option for smufin. It used 1TB of memory so that's fine, but after 24 hours it hadn't finished and I cannot continue using the node.
The last logging messages are:
num_targets: 24
num_targets: 24
At this point I'm a bit stuck. I cannot use 20+ nodes often as this is a shared cluster. I saw in another post you expected such runs to take no more than 12 hours on 28 nodes.
Can you give me any suggestions? How long would you expect this to take or is there some way I can optimize the run?