Error conditions and retries in gearman?

646 views
Skip to first unread message

Cherian Thomas

unread,
Jan 12, 2012, 6:14:39 AM1/12/12
to gea...@googlegroups.com

Hi all,

Can someone guide me on how gearman does retries when exceptions are thrown or when errors occur?

I use the python gearman client in a Django app and my workers are initiated as a Django command. I read from this blog post that retries from error conditions are not straight forward and that it requires sys.exit from the worker side.

Has this been fixed to retry perhaps with sendFail or sendException? Also does gearman support retries with exponentials algorithm – say if an SMTP failure happens its retries after 2,4,8,16 seconds etc?

Help much appreciated!


Regards,
Cherian

Cherian

unread,
Jan 28, 2012, 10:03:07 AM1/28/12
to gearman
Hi all,
Can someone help me with this?
-Cherian

On Jan 12, 4:14 pm, Cherian Thomas <cherian...@gmail.com> wrote:
> Hi all,
>
> Can someone guide me on how gearman does retries when exceptions are thrown
> or when errors occur?
>
> I use the python gearman client <http://pypi.python.org/pypi/gearman/> in a
> Django app and my workers are initiated as a Django command. I read from this
> blog post <http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>that

Faustino Olpindo

unread,
Jan 30, 2012, 6:51:38 AM1/30/12
to gea...@googlegroups.com
Hi,

I never used python with my application but with our application the client is the one that initiates retries in case of failure.

Cherian Thomas

unread,
Jan 30, 2012, 11:06:10 PM1/30/12
to gea...@googlegroups.com

What if one client fails for a reason that is beyond his control? Isn’t this one the reasons gearman is made for? If one client fails due to low disk space then there is not point for it to retry, right? Gearman in that case should assign to another worker. In my case a gearman worker’s task is to send mail from the local sendmail interface. In many cases the queue will be very long and I don’t want it to delay. So I should be able to tell gearman that I failed and you should reassign it to someone else.

I was looking for the correct way for a client to tell the server. Listen, I failed my task. Now you take over and do what’s best.

This blog post list error cases and retries. But that was a long time ago.

-Cherian

Rhett Garber

unread,
Jan 31, 2012, 12:59:37 AM1/31/12
to gea...@googlegroups.com
I was just dealing with this issue the other day. The python library
definitely doesn't give you as much flexibility around this as I would
like.

As it stands, the default behavior will be to mark the job as
complete, whether it failed or not.

I created a custom sub-class of gearman worker that re-queues the
task. The benefit to this approach is that the failed task ends up at
the end of the queue, just in case it's a 'poison' task that always
breaks the worker due to a bug. I've found that in practice it's
really easy to get in a situation where your gearman workers just spin
on failed task because of a bug you introduced. Having some sort of
exponential fall off support within gearmand would be awesome, but I
don't think it exists. I was considering adding some worker-level code
to handle limited retries and maybe causing the worker itself to
throttle if it saw the same failure too many times.

My code looks something like this: https://gist.github.com/1709099

Rhett

Rhett Garber

unread,
Jan 31, 2012, 1:02:37 AM1/31/12
to gea...@googlegroups.com
Oh I should clarify that I'm only talking about background tasks here,
the only kind I ever use.

Rhett

Clint Byrum

unread,
Jan 31, 2012, 1:30:11 AM1/31/12
to gearman
Excerpts from Cherian Thomas's message of Mon Jan 30 20:06:10 -0800 2012:

> I was looking for the correct way for a client to tell the server. Listen,
> I failed my task. Now you take over and do what’s best.
>
> This blog post<http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>list

> error cases and retries. But that was a long time ago.
>
> -Cherian
>
> On Mon, Jan 30, 2012 at 5:21 PM, Faustino Olpindo <folp...@gmail.com>wrote:
>
> > Hi,
> >
> > I never used python with my application but with our application the
> > client is the one that initiates retries in case of failure.
> >
> >
> >
> What if one client fails for a reason that is beyond his control? Isn’t
> this one the reasons gearman is made for? If one client fails due to low
> disk space then there is not point for it to retry, right? Gearman in that
> case should assign to another worker. In my case a gearman worker’s task is
> to send mail from the local sendmail interface. In many cases the queue
> will be very long and I don’t want it to delay. So I should be able to tell
> gearman that I failed and you should reassign it to someone else.
>

If its a sync job, the client should receive the failure and retry
the job. If its an async job, the worker should add the job back to
the queue before returning an answer.

In your case, perhaps the worker should stop asking for jobs to do and
poll the mail queue length until it gets acceptibly short. This would
have the effect of pushing the jobs to other workers, and if all of the
queues are too long, pushing the queue into gearmand itself.

The problem with having gearmand do this is that you lose any
ability to apply real logic to the question of whether or not to even
retry. Sometimes no answer is better than waiting for the right answer.

(Btw, please don't top post, reply inline next time.)

Cherian Thomas

unread,
Jan 31, 2012, 10:11:36 AM1/31/12
to gea...@googlegroups.com
On Tue, Jan 31, 2012 at 11:29 AM, Rhett Garber <rhe...@gmail.com> wrote:
I was just dealing with this issue the other day. The python library
definitely doesn't give you as much flexibility around this as I would
like.

As it stands, the default behavior will be to mark the job as
complete, whether it failed or not.

I created a custom sub-class of gearman worker that re-queues the
task. The benefit to this approach is that the failed task ends up at
the end of the queue, just in case it's a 'poison' task that always
breaks the worker due to a bug. I've found that in practice it's
really easy to get in a situation where your gearman workers just spin
on failed task because of a bug you introduced. Having some sort of
exponential fall off support within gearmand would be awesome, but I
don't think it exists. I was considering adding some worker-level code
to handle limited retries and maybe causing the worker itself to
throttle if it saw the same failure too many times.

My code looks something like this: https://gist.github.com/1709099

Rhett,

Thanks for the heads up.

This code looks good for an approach. What I am most concerned is the spins. What if you have only one worker? Have you considered maintaining a state?

Gearman offers a “--job-retries=RETRIES”. Have you considered that? I am not yet sure what classifies for a retry by gearman. Is there an error code we should be passing?

-Cherian 

Cherian Thomas

unread,
Jan 31, 2012, 10:14:19 AM1/31/12
to gea...@googlegroups.com
On Tue, Jan 31, 2012 at 12:00 PM, Clint Byrum <cl...@fewbar.com> wrote:
Excerpts from Cherian Thomas's message of Mon Jan 30 20:06:10 -0800 2012:
> I was looking for the correct way for a client to tell the server. Listen,
> I failed my task. Now you take over and do what’s best.
>
> This blog post<http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>list
> error cases and retries. But that was a long time ago.
>
> -Cherian
>
> On Mon, Jan 30, 2012 at 5:21 PM, Faustino Olpindo <folp...@gmail.com>wrote:
>
> > Hi,
> >
> > I never used python with my application but with our application the
> > client is the one that initiates retries in case of failure.
> >
> >
> >
> What if one client fails for a reason that is beyond his control? Isn’t
> this one the reasons gearman is made for? If one client fails due to low
> disk space then there is not point for it to retry, right? Gearman in that
> case should assign to another worker. In my case a gearman worker’s task is
> to send mail from the local sendmail interface. In many cases the queue
> will be very long and I don’t want it to delay. So I should be able to tell
> gearman that I failed and you should reassign it to someone else.
>

If its a sync job, the client should receive the failure and retry
the job.  If its an async job, the worker should add the job back to
the queue before returning an answer.
 
If this is an async job and I add it back to the queue, is there a way to maintain state and add it back only “n” number of times? 

Keith Kruepke

unread,
Jan 31, 2012, 10:30:42 AM1/31/12
to gea...@googlegroups.com
You could put a count in your payload. Then increment or decrement the
count every time the same job is resubmitted.

-----Original Message-----
From: Cherian Thomas
Sent: Tue, 31 Jan 2012 20:44:19 +0530
To: gearman
Subject: Re: Error conditions and retries in gearman?

On Tue, Jan 31, 2012 at 12:00 PM, Clint Byrum <cl...@fewbar.com
<mailto:cl...@fewbar.com>> wrote:

Excerpts from Cherian Thomas's message of Mon Jan 30 20:06:10 -0800
2012:
> I was looking for the correct way for a client to tell the
server. Listen,
> I failed my task. Now you take over and do what’s best.
>
> This blog
post<http://www.hermanradtke.com/blog/retrying-failed-gearman-jobs/>list
> error cases and retries. But that was a long time ago.
>
> -Cherian
>
> On Mon, Jan 30, 2012 at 5:21 PM, Faustino Olpindo

<folp...@gmail.com <mailto:folp...@gmail.com>>wrote:

Clint Byrum

unread,
Jan 31, 2012, 10:35:58 AM1/31/12
to gearman
Excerpts from Cherian Thomas's message of Tue Jan 31 07:14:19 -0800 2012:

Yes, you'd make your gearman payload a serialized object with the real payload
and the number of retries. So something like

def handle_job(job):
payload = json_decode(job.payload())
obj=payload['object']
retries=payload['retries']
try:
...
except:
if retries > 5:
log_and_drop_job(obj)
else:
payload = {'object'=>obj, 'retries'=>retries+1}
gearman_client.do_background('job',json_encode(payload))

Clint Byrum

unread,
Jan 31, 2012, 10:36:54 AM1/31/12
to gearman
Excerpts from Cherian Thomas's message of Tue Jan 31 07:11:36 -0800 2012:

This is a catastrophic retry, if your worker disappears without ever
returning anything. Any response to the job will result in the job being
removed from the queue.

Cherian Thomas

unread,
Jan 31, 2012, 10:42:58 AM1/31/12
to gea...@googlegroups.com

I am getting greedy now :-)

Is it possible to introduce a delay from the server end? Perhaps the exponential back off algorithm?

A real world example of this would be your SMTP not responding. Retrying immediately is futile. 2 sec, 4 sec, 8 sec etc would be a good approach. 

Clint Byrum

unread,
Jan 31, 2012, 3:33:11 PM1/31/12
to gearman
Excerpts from Cherian Thomas's message of Tue Jan 31 07:42:58 -0800 2012:

You'd have to do the delay in the worker somehow, which is suboptimal.

There's an open feature request for having the ability to delay a work
item being handed out, which is part of the gearman protocol but has
never been re-implemented in the port from perl to C:

https://bugs.launchpad.net/gearmand/+bug/686757

And a branch done by me that needs some testing:

https://code.launchpad.net/~clint-fewbar/gearmand/use-worker-timeout

Cherian Thomas

unread,
Jan 31, 2012, 10:31:57 PM1/31/12
to gea...@googlegroups.com

Thanks a lot Clint and Rhett. This is lot of insight.

For the sleep functionality rather that merge Clint’s branch I’ll go with a hybrid approach that you  guys have shown

i.e in the


def on_job_exception(self, current_job, exc_info):

def handle_job(job):
   payload = json_decode(job.payload())
   obj=payload['object']
   retries=payload['retries']

time.sleep(2*retries)


This will give it an exponential delay, though doing this in the client is bad.

Let me come back with pseudo code in a day or two.

-Cherian 


Clint Byrum

unread,
Jan 31, 2012, 11:27:40 PM1/31/12
to gearman
Excerpts from Cherian Thomas's message of Tue Jan 31 19:31:57 -0800 2012:

You can also just make use of an external scheduler... I used the unix
command 'at' to do this in one implementation.. just dumping the payload
into a text file and scheduling its re-insertion after the delay.

Cherian Thomas

unread,
Feb 1, 2012, 12:11:07 AM2/1/12
to gea...@googlegroups.com

Thanks Clint.

This helps, although I don’t want to intermix bash and python. It sometimes gets in the way. My approach would be to take epoch and schedule for that rather than “at” since i don't want to sleep for hours or days. But this is a new learning.

-Cherian

Cherian Thomas

unread,
Feb 3, 2012, 11:04:47 AM2/3/12
to gea...@googlegroups.com

Hi Clint & Rhett,

This might sound a bit stupid, but correct me…

While I was writing up the code based on your advice, I stumbled upon a problem for which I haven’t found an answer yet .

So to retry the job in an exponentially delayed fashion I went ahead can changed django_gearman code as follows

class RetryJobPayload(object):

    def __init__(self,data,retry):

        self.retry = retry

        self.data = data

    def __unicode__(self):

        return "<RetryJobPayload %d %s>" % (self.retry, self.data)

   

class DjangoGearmanWorker(GearmanWorker):

    """

    gearman worker, automatically connecting to server and discovering

    available jobs

    """   

    data_encoder = PickleDataEncoder

  

    def on_job_execute(self, current_job):

        if isinstance(current_job.data,RetryJobPayload):

            current_job.data = current_job.data.data

       print "Job started"

        return super(DjangoGearmanWorker, self).on_job_execute(current_job)

 

    def on_job_exception(self, current_job, exc_info):

 

        retry = 0

        if isinstance(current_job.data, RetryJobPayload):

            retry = current_job.data.retry

            current_job.data = current_job.data.data

        else:

            print "original"

       

        client_retries = int(getattr(settings,'GEARMAN_CLIENT_RETRIES',1))

 

        if retry <= client_retries:

            retry += 1

            LOG.error("Gearman job %s failed. Attempting Retry after %d seconds."%(current_job.task,2*retry))

            time.sleep( 2 * retry )

 

            current_job.data = RetryJobPayload(current_job.data,retry)

            

            print "sending ... " + str(current_job.data)

           

            try:

                gm_client = DjangoGearmanClient()

                gm_client.submit_job(task = current_job.task,

                                     data = current_job.data,

                                     wait_until_complete = False,

                                     max_retries = 5)

            except Exception as e:

                LOG.exception(str(e))

       

        # Remember, u still have to exist since we have spawned a new job and not reusing the existing one

        print "Job failed"

        return super(DjangoGearmanWorker, self).on_job_exception(current_job, exc_info)

 

Now the problem I am facing is to encode the retry counter into the payload.

I haven’t really found out a way to do this since on the job execution I am sanitizing the payload to remove the retry counter since the method that needs to be called to do the task needs the data or payload of the type that it is excepting

e.g.

@gearman_job

def image_push(data,*args,**kwargs):

    assert isinstance(data,JPEGImage), "data should be instance of JPEGImage"

Effectively I can’t increment the counter , since a new job a spawned on every failure and a state cant be maintained.

Do you have any thoughts? Help?


Regards,
Cherian

Clint Byrum

unread,
Feb 3, 2012, 4:07:51 PM2/3/12
to gearman
Excerpts from Cherian Thomas's message of Fri Feb 03 08:04:47 -0800 2012:

> Hi Clint & Rhett,
>
> This might sound a bit stupid, but correct me…
>
> While I was writing up the code based on your advice, I stumbled upon a
> problem for which I haven’t found an answer yet .
>
> So to retry the job in an exponentially delayed fashion I went ahead can
> changed django_gearman <https://github.com/webright/django-gearman>code as


Glad to see you're getting closer to a solution.

When I was doing this, in PHP, I made an abstract class that would be extended
by any workers. In python the equivalent would be:

class BaseWorker(object):
def __init__(self, gearman_server):
self._job_name=None
self._server=gearman_server

def _register(self):
return self._server.add_function(self._job_name, self._run)

@classmethod
def _run(self, data):
real_payload = json_decode(data)
real_data=real_payload['data']
retry_counter=real_payload['retries']
result = self.run(real_data)
if result:
return result
else:
if retry_counter < RETRY_THRESHOLD:
real_payload['retries'] += 1
self._server.do_background(self._job_name, json_encode(real_payload))
else:
log_failure(real_data)
return result

I'm not sure if that will actually work in python, as its object model
is a bit different than PHP's, but you get the idea, right? Just have
workers that set self._job_name and implement run() and let thise
BaseWorker implement the retrying stuff.

Cherian Thomas

unread,
Feb 5, 2012, 9:18:59 AM2/5/12
to gea...@googlegroups.com

Thanks Clint. That was a deal of insight.

Thankfully django_gearman has a decorator pattern that made it easy for me to fix.

Now for the patch to the community…

I am still in for the server side solution. This worker based retry is a huge risk, especially if the you restart the worker during a build process and it’s in the middle of a sleep. 


Regards,
Cherian

Clint Byrum

unread,
Feb 5, 2012, 12:01:12 PM2/5/12
to gearman
Excerpts from Cherian Thomas's message of Sun Feb 05 06:18:59 -0800 2012:

> Thanks Clint. That was a deal of insight.
>
> Thankfully django_gearman has a decorator pattern that made it easy for me
> to fix.
>
> Now for the patch to the community…
>
> I am still in for the server side solution. This worker based retry is a
> huge risk, especially if the you restart the worker during a build process
> and it’s in the middle of a sleep.
>

Well if the sleep is done between the server giving you the job, and
you responding with a result, the server *will* retry on its own, as
that is a catastrophic job failure.

As far as the back-off/delay.. I just tried merging my old branch and
things have changed quite drastically in gearmand since I did the work,
so it will require some reworking.

Simpler solution is to just persist retries someplace else.

Reply all
Reply to author
Forward
0 new messages