Resubmit from scratch only a few jobs

Al Kas

unread,

Jul 10, 2015, 7:14:12 AM7/10/15

to grid-c...@googlegroups.com

Hello

I have this question - Out of 2600 jobs, I get an 65 error code for 10 of then - In the end, how it is possible to "clean" and resubmit from scratch only those 10 ie avoiding to send all the jobs again ? The problem was some i/o from our site, so I guess it failed to send properly the tarball to the working node.

Can you please help ?

Regards

Alexis

Max Fischer

unread,

Jul 10, 2015, 7:28:14 AM7/10/15

to Al Kas, grid-c...@googlegroups.com

Hi Alexis,

if you just want to run them again, using the same submit-side setup (configuration, executable, job parameters, …), it is generally enough to increase the retry count for jobs. This is will resubmit any failed jobs, leaving the successful jobs untouched.

From the command line, ``-m, --max-retry`` does the job (e.g. ``./go.py -m10 <config>`` to retry up to ten times). In the configuration, ``[jobs] max retry`` does the same.

Note that if you have based your config on someone else’s config, it can happen that you have implicitly disabled retries via ``[global] cmdargs = -m 0`` which overwrites ``[jobs] max retry``.

Cheers,

Max

--
You received this message because you are subscribed to the Google Groups "grid-control" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grid-control...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Al Kas

unread,

Jul 10, 2015, 7:34:43 AM7/10/15

to grid-c...@googlegroups.com

Dear Max

Sorry, maybe I was not clear - the failed jobs are resubmitted and then fail for ever (yes, I did not specify a max trial number) so this is going and going and going ;-) - so the question is, how do I force a "clean" resubmittion of those jobs ie to clean the sandbox/tarball for these jobs and resubmitted again like there had never been submitted (this is what I call from "scratch") but without of course loosing the good/completed ones.

Thanks again

Alexis

Max Fischer

unread,

Jul 10, 2015, 8:08:07 AM7/10/15

to Al Kas, grid-c...@googlegroups.com

Hi Alexis,

you might want to try the ``—init`` command line switch, then. It should recreate all input data, including the tarsal.

I just tried it with stable-fixes, the job status (Success/Failed/...) is left intact.

Cheers,

Max

Al Kas

unread,

Jul 14, 2015, 10:14:58 AM7/14/15

to Max Fischer, grid-c...@googlegroups.com

Hi Max

actually, this is how I was submitting in the first place - but to me, looks like that the failed jobs are not re-initialized (and this is what I want) rather than resumed, even if I kill current gui and start a new one.

Regards

Alexis

Raphael Friese

unread,

Jul 14, 2015, 10:30:30 AM7/14/15

to grid-c...@googlegroups.com

Hi Alexis,

since it's only 10 jobs, maybe just note down the job numbers and specify which jobs to care about by hand. This means starting gc using "-J id:1,4,4-8"

Cheers,

Raphael

-- 
Raphael Friese

Karlsruhe Institute of Technology (KIT)
Kaiserstraße 12
Building 30.23 / R 8-22
76131 Karlsruhe, Germany
Phone: +49 721 608-47243

CERN
Building 32-4 B06
Phone: +41 22 76-78267

Email: Raphael...@cern.ch

Max Fischer

unread,

Jul 15, 2015, 7:09:20 AM7/15/15

to Al Kas, grid-c...@googlegroups.com

Hi Alexis,

when you run with ``—init``, the following should happen:

-o GC will ask whether to sync parameters. Answer yes unless you are sure nothing changed.

-o GC will aks whether to recreate the runtime/tarbal. Answer yes, and GC should replace any previous tarballs with the updated version. Both new and re-submitted jobs will use the new runtime.

Does this happen in your case? What version (stable, stable-fixes, trunk) are you using?

Cheers,

Max

Al Kas

unread,

Jul 17, 2015, 5:51:55 AM7/17/15

to grid-c...@googlegroups.com

Hello

Yes, this is what I do ie python go.py -iGc gc_config.conf

but the failed ones are still failing (even with a error 134 which is not listed here https://ekptrac.physik.uni-karlsruhe.de/trac/grid-control/wiki/ErrorCodes) ... I am using 469:1395 - stable - maybe this is outdated now ?

thanks again

Alexis

Max Fischer

unread,

Jul 17, 2015, 6:05:35 AM7/17/15

to Al Kas, grid-c...@googlegroups.com

Hi Alexis,

an exit code above 128 means your job was killed by a signal [1]. In your case signal 6, i.e. SIGABRT. This is likely due to an internal error in your application (such as wrong mallocs), but may be caused by any number of reasons, including batch system policies.

I would suggest upgrading to grid-control stable-fixes, since stable is not actively maintained. It will probably not affect your problem at hand, though.

Cheers,

Max

[1]

http://www.tldp.org/LDP/abs/html/exitcodes.html#EXITCODESREF

Al Kas

unread,

Jul 17, 2015, 6:09:07 AM7/17/15

to grid-c...@googlegroups.com

By the way, there is no a way to delete by hand those problematic jobs from the workdir, so in this way if I reinitialize the Gui it will just create them from scratch ? Can this work ?

Thanks again

Regards

Reply all

Reply to author

Forward