Hi all,
Can someone guide me on how gearman does retries when exceptions are thrown or when errors occur?
I use the python gearman client in a Django app and my workers are initiated as a Django command. I read from this blog post that retries from error conditions are not straight forward and that it requires sys.exit from the worker side.
Has this been fixed to retry perhaps with sendFail or sendException? Also does gearman support retries with exponentials algorithm – say if an SMTP failure happens its retries after 2,4,8,16 seconds etc?
Help much appreciated!
What if one client fails for a reason that is beyond his control? Isn’t this one the reasons gearman is made for? If one client fails due to low disk space then there is not point for it to retry, right? Gearman in that case should assign to another worker. In my case a gearman worker’s task is to send mail from the local sendmail interface. In many cases the queue will be very long and I don’t want it to delay. So I should be able to tell gearman that I failed and you should reassign it to someone else.
I was looking for the correct way for a client to tell the server. Listen, I failed my task. Now you take over and do what’s best.
This blog post list error cases and retries. But that was a long time ago.
-Cherian
As it stands, the default behavior will be to mark the job as
complete, whether it failed or not.
I created a custom sub-class of gearman worker that re-queues the
task. The benefit to this approach is that the failed task ends up at
the end of the queue, just in case it's a 'poison' task that always
breaks the worker due to a bug. I've found that in practice it's
really easy to get in a situation where your gearman workers just spin
on failed task because of a bug you introduced. Having some sort of
exponential fall off support within gearmand would be awesome, but I
don't think it exists. I was considering adding some worker-level code
to handle limited retries and maybe causing the worker itself to
throttle if it saw the same failure too many times.
My code looks something like this: https://gist.github.com/1709099
Rhett
Rhett
If its a sync job, the client should receive the failure and retry
the job. If its an async job, the worker should add the job back to
the queue before returning an answer.
In your case, perhaps the worker should stop asking for jobs to do and
poll the mail queue length until it gets acceptibly short. This would
have the effect of pushing the jobs to other workers, and if all of the
queues are too long, pushing the queue into gearmand itself.
The problem with having gearmand do this is that you lose any
ability to apply real logic to the question of whether or not to even
retry. Sometimes no answer is better than waiting for the right answer.
(Btw, please don't top post, reply inline next time.)
I was just dealing with this issue the other day. The python library
definitely doesn't give you as much flexibility around this as I would
like.
As it stands, the default behavior will be to mark the job as
complete, whether it failed or not.
I created a custom sub-class of gearman worker that re-queues the
task. The benefit to this approach is that the failed task ends up at
the end of the queue, just in case it's a 'poison' task that always
breaks the worker due to a bug. I've found that in practice it's
really easy to get in a situation where your gearman workers just spin
on failed task because of a bug you introduced. Having some sort of
exponential fall off support within gearmand would be awesome, but I
don't think it exists. I was considering adding some worker-level code
to handle limited retries and maybe causing the worker itself to
throttle if it saw the same failure too many times.
My code looks something like this: https://gist.github.com/1709099
Rhett,
Thanks for the heads up.
This code looks good for an approach. What I am most concerned is the spins. What if you have only one worker? Have you considered maintaining a state?
Gearman offers a “--job-retries=RETRIES”. Have you considered that? I am not yet sure what classifies for a retry by gearman. Is there an error code we should be passing?
Excerpts from Cherian Thomas's message of Mon Jan 30 20:06:10 -0800 2012:
> I was looking for the correct way for a client to tell the server. Listen,> This blog post<http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>list
> I failed my task. Now you take over and do what’s best.
>
> error cases and retries. But that was a long time ago.
>
> -Cherian
>
> On Mon, Jan 30, 2012 at 5:21 PM, Faustino Olpindo <folp...@gmail.com>wrote:
>
> > Hi,
> >
> > I never used python with my application but with our application the
> > client is the one that initiates retries in case of failure.
> >
> >
> >
> What if one client fails for a reason that is beyond his control? Isn’tIf its a sync job, the client should receive the failure and retry
> this one the reasons gearman is made for? If one client fails due to low
> disk space then there is not point for it to retry, right? Gearman in that
> case should assign to another worker. In my case a gearman worker’s task is
> to send mail from the local sendmail interface. In many cases the queue
> will be very long and I don’t want it to delay. So I should be able to tell
> gearman that I failed and you should reassign it to someone else.
>
the job. If its an async job, the worker should add the job back to
the queue before returning an answer.
-----Original Message-----
From: Cherian Thomas
Sent: Tue, 31 Jan 2012 20:44:19 +0530
To: gearman
Subject: Re: Error conditions and retries in gearman?
On Tue, Jan 31, 2012 at 12:00 PM, Clint Byrum <cl...@fewbar.com
<mailto:cl...@fewbar.com>> wrote:
Excerpts from Cherian Thomas's message of Mon Jan 30 20:06:10 -0800
2012:
> I was looking for the correct way for a client to tell the
server. Listen,
> I failed my task. Now you take over and do what’s best.
>
> This blog
post<http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>list
> error cases and retries. But that was a long time ago.
>
> -Cherian
>
> On Mon, Jan 30, 2012 at 5:21 PM, Faustino Olpindo
<folp...@gmail.com <mailto:folp...@gmail.com>>wrote:
Yes, you'd make your gearman payload a serialized object with the real payload
and the number of retries. So something like
def handle_job(job):
payload = json_decode(job.payload())
obj=payload['object']
retries=payload['retries']
try:
...
except:
if retries > 5:
log_and_drop_job(obj)
else:
payload = {'object'=>obj, 'retries'=>retries+1}
gearman_client.do_background('job',json_encode(payload))
This is a catastrophic retry, if your worker disappears without ever
returning anything. Any response to the job will result in the job being
removed from the queue.
I am getting greedy now :-)
Is it possible to introduce a delay from the server end? Perhaps the exponential back off algorithm?
You'd have to do the delay in the worker somehow, which is suboptimal.
There's an open feature request for having the ability to delay a work
item being handed out, which is part of the gearman protocol but has
never been re-implemented in the port from perl to C:
https://bugs.launchpad.net/gearmand/+bug/686757
And a branch done by me that needs some testing:
https://code.launchpad.net/~clint-fewbar/gearmand/use-worker-timeout
Thanks a lot Clint and Rhett. This is lot of insight.
For the sleep functionality rather that merge Clint’s branch I’ll go with a hybrid approach that you guys have shown
i.e in the
def on_job_exception(self, current_job, exc_info):
def handle_job(job):
payload = json_decode(job.payload())
obj=payload['object']
retries=payload['retries']
time.sleep(2*retries)
This will give it an exponential delay, though doing this in the client is bad.
Let me come back with pseudo code in a day or two.
-Cherian
You can also just make use of an external scheduler... I used the unix
command 'at' to do this in one implementation.. just dumping the payload
into a text file and scheduling its re-insertion after the delay.
Thanks Clint.
This helps, although I don’t want to intermix bash and python. It sometimes gets in the way. My approach would be to take epoch and schedule for that rather than “at” since i don't want to sleep for hours or days. But this is a new learning.
-Cherian
Hi Clint & Rhett,
This might sound a bit stupid, but correct me…
While I was writing up the code based on your advice, I stumbled upon a problem for which I haven’t found an answer yet .
So to retry the job in an exponentially delayed fashion I went ahead can changed django_gearman code as follows
class RetryJobPayload(object):
def __init__(self,data,retry):
self.retry = retry
self.data = data
def __unicode__(self):
return "<RetryJobPayload %d %s>" % (self.retry, self.data)
class DjangoGearmanWorker(GearmanWorker):
"""
gearman worker, automatically connecting to server and discovering
available jobs
"""
data_encoder = PickleDataEncoder
def on_job_execute(self, current_job):
if isinstance(current_job.data,RetryJobPayload):
current_job.data = current_job.data.data
print "Job started"
return super(DjangoGearmanWorker, self).on_job_execute(current_job)
def on_job_exception(self, current_job, exc_info):
retry = 0
if isinstance(current_job.data, RetryJobPayload):
retry = current_job.data.retry
current_job.data = current_job.data.data
else:
print "original"
client_retries = int(getattr(settings,'GEARMAN_CLIENT_RETRIES',1))
if retry <= client_retries:
retry += 1
LOG.error("Gearman job %s failed. Attempting Retry after %d seconds."%(current_job.task,2*retry))
time.sleep( 2 * retry )
current_job.data = RetryJobPayload(current_job.data,retry)
print "sending ... " + str(current_job.data)
try:
gm_client = DjangoGearmanClient()
gm_client.submit_job(task = current_job.task,
data = current_job.data,
wait_until_complete = False,
max_retries = 5)
except Exception as e:
LOG.exception(str(e))
# Remember, u still have to exist since we have spawned a new job and not reusing the existing one
print "Job failed"
return super(DjangoGearmanWorker, self).on_job_exception(current_job, exc_info)
Now the problem I am facing is to encode the retry counter into the payload.
I haven’t really found out a way to do this since on the job execution I am sanitizing the payload to remove the retry counter since the method that needs to be called to do the task needs the data or payload of the type that it is excepting
e.g.
@gearman_job
def image_push(data,*args,**kwargs):
assert isinstance(data,JPEGImage), "data should be instance of JPEGImage"
Effectively I can’t increment the counter , since a new job a spawned on every failure and a state cant be maintained.
Do you have any thoughts? Help?
Glad to see you're getting closer to a solution.
When I was doing this, in PHP, I made an abstract class that would be extended
by any workers. In python the equivalent would be:
class BaseWorker(object):
def __init__(self, gearman_server):
self._job_name=None
self._server=gearman_server
def _register(self):
return self._server.add_function(self._job_name, self._run)
@classmethod
def _run(self, data):
real_payload = json_decode(data)
real_data=real_payload['data']
retry_counter=real_payload['retries']
result = self.run(real_data)
if result:
return result
else:
if retry_counter < RETRY_THRESHOLD:
real_payload['retries'] += 1
self._server.do_background(self._job_name, json_encode(real_payload))
else:
log_failure(real_data)
return result
I'm not sure if that will actually work in python, as its object model
is a bit different than PHP's, but you get the idea, right? Just have
workers that set self._job_name and implement run() and let thise
BaseWorker implement the retrying stuff.
Thanks Clint. That was a deal of insight.
Thankfully django_gearman has a decorator pattern that made it easy for me to fix.
Now for the patch to the community…
I am still in for the server side solution. This worker based retry is a huge risk, especially if the you restart the worker during a build process and it’s in the middle of a sleep.
Well if the sleep is done between the server giving you the job, and
you responding with a result, the server *will* retry on its own, as
that is a catastrophic job failure.
As far as the back-off/delay.. I just tried merging my old branch and
things have changed quite drastically in gearmand since I did the work,
so it will require some reworking.
Simpler solution is to just persist retries someplace else.