I don't think this is the issue. I'd assume that the errors along the lines of:
Traceback (most recent call last):
File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread
if(job_dict["state"]==str(bigjob.state.Unknown)):
KeyError: 'state'
are not standard output for BigJob? At any rate, by looking at the output files by timestamp, it's fairly clear that the job stopped processing subjobs sometime shortly after the pilot job started. In fact, in looking at the list of input files to process, only 31 of the files were processed in one of the runs. I've requested 128 cores for the pilot job so it would seem at least that many should start. In addition there are several hundred files named something along the lines of advert-launcher-machines-sj-f6c08b12-7e33-11e3-8d91-001fc6d94bec (where the sj number is different for every file) all with a single node listed in them that have appeared in my $HOME. This is new behaviour, it hasn't happened with any of my previous BigJob runs.
-- Scott
Hi Andre,
I've updated the version of BigJob on the cluster to the latest version via easy_install. It's now at 0.63.3. The job still fails to run. With the logging level turned all the way up it produces a logfile 112MB in size. You can download the file here: https://iu.box.com/s/3611nik4aoop686vbrn9
Thanks for your help!
-- Scott
--
<stdout-bj-3e24b83e-8767-11e3-adde-001fc6d94bec-agent.txt>