Problems submitting large numbers of small serial jobs under a single pilot job

Scott Michael

unread,

Jan 20, 2014, 2:00:11 PM1/20/14

to bigjob...@googlegroups.com

Hello,

My name is Scott Michael, I'm a Principal Analyst in the Scientific Applications group at Indiana University. I've recently been trying to use BigJob on our Quarry cluster to process a large number of astronomical image files for a research project. There are roughly 125K image files spread across dozens of directories. The binary I need to run on each of the files extracts all of the objects in the file that fit a certain PSF. Each image takes on average 5 min of runtime from the binary.

To address this problem I wrote a BigJob submission script that asked for 128 cores and 125 hours of walltime. The python script constructs the list of jobs by crawling through the directories and finding all of the input files. Each job should then call the binary with the name of the individual input file. Unfortunately this approach failed with an unspecified python error. Roughly 25K of the 125K files were processed before the error was thrown. My guess was that the connection to the Redis DB on the Quarry gateway was probably lost. So I rewrote the BigJob python submission script to take a list of input files and then took the list of remaining input files and divided it into 5 smaller files and submitted 5 pilot jobs. My hope was that with a shorter total requested walltime (24 hours) the connection to the Redis DB would be more stable. However I've received the same set of error messages for the first of my smaller pilot jobs. The error file is attached here along with my latest python submission script. Can anyone tell me where this issue is coming from and what, if anything, I might be able to do about it?

Thanks in advance for your assistance.

Scott Michael Ph.D.

Principal Systems Analyst

Indiana University

Scientific Applications and Performance Tuning

University Information Technology Services

scamicha [at] iu [dot] edu

(812) 856-0197 (o)

dophotjob_ens01.py

stderr-bj-3444edf4-7e32-11e3-8d91-001fc6d94bec-agent.txt

Andre Merzky

unread,

Jan 20, 2014, 4:54:59 PM1/20/14

to bigjob...@googlegroups.com

Doesn't the last line of the log indicate the problem?

> PBS: job killed: walltime 86441 exceeded limit 86400

I think you need to submit the pilot to a queue which allows for
longer walltimes?

My $0.02,

Andre.

> --
> You received this message because you are subscribed to the Google Groups
> "bigjob-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bigjob-users...@googlegroups.com.
> To post to this group, send email to bigjob...@googlegroups.com.
> Visit this group at http://groups.google.com/group/bigjob-users.
> For more options, visit https://groups.google.com/groups/opt_out.

--
Nothing is really difficult.

Scott Michael

unread,

Jan 21, 2014, 7:57:40 PM1/21/14

to bigjob...@googlegroups.com

I don't think this is the issue. I'd assume that the errors along the lines of:

Traceback (most recent call last):

File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run

result = request.callable(*request.args, **request.kwds)

File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread

if(job_dict["state"]==str(bigjob.state.Unknown)):

KeyError: 'state'

are not standard output for BigJob? At any rate, by looking at the output files by timestamp, it's fairly clear that the job stopped processing subjobs sometime shortly after the pilot job started. In fact, in looking at the list of input files to process, only 31 of the files were processed in one of the runs. I've requested 128 cores for the pilot job so it would seem at least that many should start. In addition there are several hundred files named something along the lines of advert-launcher-machines-sj-f6c08b12-7e33-11e3-8d91-001fc6d94bec (where the sj number is different for every file) all with a single node listed in them that have appeared in my $HOME. This is new behaviour, it hasn't happened with any of my previous BigJob runs.

-- Scott

Andre Luckow

unread,

Jan 21, 2014, 9:56:29 PM1/21/14

to bigjob...@googlegroups.com

Hi Scott,
can you retry with the latest BigJob version (0.56), please?

If the error persists, please rerun with BIGJOB_VERBOSE set to 5 and
sent the logs, please.

Thanks,
Andre

Scott Michael

unread,

Jan 29, 2014, 4:28:37 PM1/29/14

to bigjob...@googlegroups.com

Hi Andre,

I've updated the version of BigJob on the cluster to the latest version via easy_install. It's now at 0.63.3. The job still fails to run. With the logging level turned all the way up it produces a logfile 112MB in size. You can download the file here: https://iu.box.com/s/3611nik4aoop686vbrn9

Thanks for your help!

-- Scott

Andre Luckow

unread,

Jan 29, 2014, 8:56:58 PM1/29/14

to bigjob...@googlegroups.com

Hi Scott,
thanks, some strange CU description ended up to be in Redis....

instead of:
{'Executable': '/N/dc2/projects/BDBS/cijohnson//dorunner.sh',
'WorkingDirectory':
'/N/dc2/projects/BDBS/cijohnson//./lp4.0058bm6.8292',
'NumberOfProcesses': '1', 'start_time': '1390836474.31',
'Environment': "['TASK_NO=4687']", 'state': 'Unknown', 'Arguments':
"['/N/dc2/projects/BDBS/cijohnson/./lp4.0058bm6.8292
tu1783717_58.in\\n']", 'Error': 'tu1783717_58.err', 'Output':
'tu1783717_58.out', 'job-id':
'sj-976b4976-8767-11e3-adde-001fc6d94bec', 'SPMDVariation': 'single'}

this:
{'a': 'd', 'c': '-', 'b': 'e', 'd': 'e', 'f': 'c', '-': '0', '3':
'-', '1': 'e', '0': '1', 's': 'j', '7': 'f', '6': 'd', '9': '4', '8':
'7'}

Can you please also sent me your master log in DEBUG mode?

Thanks,
Andre

Scott Michael

unread,

Jan 30, 2014, 11:06:19 AM1/30/14

to bigjob...@googlegroups.com

Hi Andre,

Sorry, which file is the master log? The only other BigJob output I can find is the stdout file. It is attached.

--Scott

stdout-bj-3e24b83e-8767-11e3-adde-001fc6d94bec-agent.txt

Shantenu Jha

unread,

Feb 10, 2014, 7:46:57 PM2/10/14

to scam...@gmail.com, bigjob...@googlegroups.com

Hi Scott,

Are you up and running, or still have trouble(s)?

Please let me know.

Thanks
Shantenu

--

Ole Weidner

unread,

Feb 18, 2014, 11:23:48 AM2/18/14

to bigjob...@googlegroups.com

Hi AndreL,

can you have a look at this please? The entry below roughly looks like a UUID (?) string that was accidentally written as a dict, but I am not sure how to debug this.

I have opened a ticket for this: https://github.com/saga-project/BigJob/issues/176

Thanks!

Ole

<stdout-bj-3e24b83e-8767-11e3-adde-001fc6d94bec-agent.txt>

signature.asc

Reply all

Reply to author

Forward