Runtime & cluster size

skill...@gmail.com

unread,

Jul 1, 2015, 6:08:08 AM7/1/15

to smu...@googlegroups.com

I have a question regarding the cluster size/set up and the runtime.

I have a 90X WGS tumor/normal pair that I have been trying to run. Our cluster nodes have on average 48GB RAM, but the cluster is heterogenous. I have tried submitting a SMuFin run on 20 whole nodes (unshared). After 24 hours every run I have tried has failed with an out of memory error.

Our HPC team recommended that I use our single large node with 4TB mem submitted with the cpus=160 option for smufin. It used 1TB of memory so that's fine, but after 24 hours it hadn't finished and I cannot continue using the node.

The last logging messages are:

num_targets: 24
num_targets: 24

At this point I'm a bit stuck. I cannot use 20+ nodes often as this is a shared cluster. I saw in another post you expected such runs to take no more than 12 hours on 28 nodes.

Can you give me any suggestions? How long would you expect this to take or is there some way I can optimize the run?

montse

unread,

Jul 7, 2015, 5:38:12 AM7/7/15

to smu...@googlegroups.com

Can you tell me how are you running the SMuFin? What is the command line are you using?

Thank you,
SMuFin

skill...@gmail.com

unread,

Jul 7, 2015, 9:55:35 AM7/7/15

to smu...@googlegroups.com

We’ve now run it in two different contexts. In the first we have 20 nodes from a heterogeneous cluster that has 24GB - 64GB of memory per node, and 2-4 cpus with 12 threads per cpu (according to the admin). In this one of the nodes would throw an out of memory error and everything would fail after 12-20 hours. Note, this is a shared cluster I access when possible and I cannot regularly fully use 20 nodes.

mpirun -np 12 -hostfile $OAR_NODEFILE ./SMuFin … -cpus_per_node 12

This one fails to read in the FASTQ files all the way based on adding debugging statements to the main.cpp file.
__________________________________

In the second we used a single massive node with 4TB of memory and 160 threads total. In this we saw it use 1TB of memory, and after 20 hours (we tried this three times)

mpirun -np 16 -hostfile $OAR_NODEFILE ./SMuFin … -cpus_per_node 160

This one reads in all of the FASTQ files, but in this after 20 hours it would fail with

Fatal error in MPI_Send: Invalid count, error stack:
MPI_Send(196): MPI_Send(buf=0x7ff05a3b5010, count=-1063321726, MPI_CHAR, dest=6, tag=10, MPI_COMM_WORLD) failed
MPI_Send(113): Negative count, value is -1063321726

Reply all

Reply to author

Forward