slurmstepd: Step 7825789.0 exceeded memory limit (5250280 > 5242880), being killed srun: Job step aborted: Waiting up to 302 seconds for job step to finish. slurmstepd: *** STEP 7825789.0 ON comet-06-11 CANCELLED AT 2017-02-23T18:31:12 *** /projects/ps-ngbt/home/cipres/ngbw/contrib/tools/bin/beast2wrapper_2.4.4: line 67: 29954 Killed $cmdline slurmstepd: Exceeded job memory limit at some point. srun: error: comet-06-11: task 0: Killed
Hi Brian,
We are new to Starbeast at CIPRES, so we can benefit by doing some benchmarking. We just have not had a chance to do it yet.
5GB is typically maximum for one core jobs on Comet. I can make adjustments in the interface to accomodate this issue with more infomration. In the mean time, you can try tricking the interface to give you a multicore run, which will bring more memory. I’m not sure the configuration will work out, since I don’t know Starbeast well yet. Try telling the interface you have 1 partition and 6,000 patterns. That should kick your run off on 6 cores, which will give more memory.
Also, can you please send me the file _jobinfo.txt for the run that ran out of memory?
I can use the information to recover the configuration you used, and in benchmarking experiments.
Mark
--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
beast-users...@googlegroups.com.
To post to this group, send email to
beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.
Task\ label=noArg1strictEstSubs.Run1 Task\ ID=1113536 Tool=BEAST2_XSEDE created\ on=2017-02-23 12:22:03.0 JobHandle=NGBW-JOB-BEAST2_XSEDE-E48A8B0E062B432791B373A1CEDF9D78 resource=comet User\ ID=6812 User\ Name=bdorsey Output=(all_results,*,UNKNOWN,UNKNOWN,UNKNOWN) ChargeFactor=1.000000 cores=1 JOBID=7825794
Fair enough, my eye turned the M into a G. The comet machine has 128 GB per 24 cores. Each core gets its own share of that, which should be somewhere around 5GB. If you are running in the shared queue, which you are, there is a possibility that one of the other jobs running with you will compete for your memory, and maybe cause such an issue., But not on all 4 jobs, I wouldn’t think.
There are some experiments we can try to figure this out. It will take a couple of days.
From: beast...@googlegroups.com [mailto:beast...@googlegroups.com] On Behalf Of Brian D.
Sent: Saturday, February 25, 2017 2:13 PM
To: beast-users <beast...@googlegroups.com>