Hi
I recently tried a CuffDiff computation which kept running many hours and then getting killed.
My jobs were run on a small cluster where each node has 24G ram and 4G swap space and dual quad core CPUs.
These jobs were killed by the system because they seemed to be asking for more swap space than a single node had available.
Accounting information about this job:
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
169725.98 2 64381 exit 2.6363 12974M 14480M
I tried to get my LSF scheduler to distribute the jobs to enough nodes to then have an aggregate swap space in excess of the 14G it was asking for.
I attempted to do this by asking LSF to give me 8 nodes (each node a dual quad core unit with 24G ram on each) assuming I would then have the aggregate 32 G swap ( 8x4 G swap per node).
CuffDiff was being asked to use 8 threads (-p 8)
The job always started on one node and stayed there. And crashed.
I then used LSF to restrict the number of processes to a single process per node. Once again the job always started on one node and stayed there. And crashed.
The accounting information for these jobs indicated that a number of processes (4) had been started as well as a number of threads ( say 14) eg
Resource usage summary:
CPU time : 169725.98 sec.
Max Memory : 12974 MB
Max Swap : 14480 MB
Max Processes : 4
Max Threads : 14
So why if threads is set to 8, is the program launching 14 threads? and 4 processes
If I restrict the number of processes per node to one, why does the program stay on one node?
Is it possible to get CuffDIff to run on more than a single node?
Starr