Retry intervals

309 views
Skip to first unread message

Kevin Taylor

unread,
Jan 20, 2015, 5:52:32 AM1/20/15
to chronos-...@googlegroups.com

Is it possible to configure the retry operation?

It appears that you can specify the number of retries in a failure scenario but not the interval between each retry.


As an example I kick of a job which polls for a given file before its dependents are then scheduled. 

If the script returns a non-zero condition, I would then like it to wait for a specified period before trying again.

If it reaches the maximum retries in that given period, then I would treat this as a failure

Any advice on this, or intention of including this feature?

Regards, Kevin



Kevin Taylor

unread,
Jan 20, 2015, 7:13:31 AM1/20/15
to chronos-...@googlegroups.com
I have also noticed that the retry doesn't appear to work on subsequent operations

I had a job which has retries: 5 - Yesterday it executed and failed, then retried 5 times okay. The job is set to run every 24 hours
Today it fired correctly after 24 hours and only executed on a single iteration - it didn't retry 5 times

Is this expected behaviour?

Brenden Matthews

unread,
Jan 20, 2015, 9:48:25 AM1/20/15
to Kevin Taylor, chronos-...@googlegroups.com
That's correct. There is a delay parameter, here: https://github.com/mesos/chronos/blob/1ca23e8a3c22f0f27230f5ef8bf44aa8c46bb3ad/src/main/scala/com/airbnb/scheduler/config/SchedulerConfiguration.scala#L112-L114

It's a global config param, rather than per-job.

If a job did not succeed in the previous run, then it will not be retried (it will only make 1 attempt). This is to prevent flooding the cluster with failed jobs in cases where things go wrong, or when jobs are completely broken. They will also become disabled after enough attempts.

--
You received this message because you are subscribed to the Google Groups "chronos-scheduler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chronos-schedu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Taylor

unread,
Jan 20, 2015, 9:52:23 AM1/20/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Thanks Brenden

Do you foresee any appetite for having a per-job retry config parameter?

Brenden Matthews

unread,
Jan 20, 2015, 9:53:59 AM1/20/15
to Kevin Taylor, chronos-...@googlegroups.com
Pull requests are welcome. It should be fairly simple to add, just follow the existing code and look at some merged PRs.

Kevin Taylor

unread,
Jan 27, 2015, 11:02:26 AM1/27/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
I am having a struggle making the failure retry interval work. As a default the job retries twice and as far as I can see should wait in between jobs for the specified time.

ct:1422371866641:2:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago
ct:1422371865629:1:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago
ct:1422370932000:0:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago

My interval is set to 900,000 ms (15 minutes). Looking at the ct: template output this appears to output correctly which is a bit baffling, but the jobs are not waiting for the specified time before retry

In the JobScheduler.scala is appears to call the enqueue method directly which doesn't seem to have any delay mechanism in it, but I am not over familiar with the code base

Have I made some incorrect assumptions?

Thanks for any help

Kevin

Brenden Matthews

unread,
Jan 27, 2015, 11:05:28 AM1/27/15
to Kevin Taylor, chronos-...@googlegroups.com
It's quite possible that it's busted. Can you give me any more logging context? And the JSON of the job (you can grab it from the `/scheduler/jobs` endpoint)? I'll check it out later today.

Kevin Taylor

unread,
Jan 27, 2015, 11:21:27 AM1/27/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a...@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]


This is a snippet from the log, where the retries take place which shows it just retries immediately on failure. 
There's a fair amount of other info messages in the log, so not sure if this will be enough. Please let me know if you want the full trace


airbnbchronos_1 | [2015-01-27 16:06:56,924] WARN Starting [class com.airbnb.notification.MailClient] notification client. (com.airbnb.scheduler.config.MainModule:104)
airbnbchronos_1 | [2015-01-27 16:06:57,117] WARN No SSL support configured. (mesosphere.chaos.http.HttpModule:62)
airbnbchronos_1 | [2015-01-27 16:10:58,913] WARN Adding vertex:kev (com.airbnb.scheduler.graph.JobGraph:65)
airbnbchronos_1 | [2015-01-27 16:10:58,914] WARN Current number of vertices:1 (com.airbnb.scheduler.graph.JobGraph:72)
airbnbchronos_1 | [2015-01-27 16:11:02,247] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:02,622] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:02,622] WARN Retrying job: kev, attempt: 0 (com.airbnb.scheduler.jobs.JobScheduler:392)
airbnbchronos_1 | [2015-01-27 16:11:03,220] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:03,628] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:03,629] WARN Retrying job: kev, attempt: 1 (com.airbnb.scheduler.jobs.JobScheduler:392)
airbnbchronos_1 | [2015-01-27 16:11:04,220] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:04,637] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:04,662] WARN Job failed beyond retries! (com.airbnb.scheduler.jobs.JobScheduler:438)


Thanks, Kevin

Kevin Taylor

unread,
Jan 29, 2015, 4:22:32 AM1/29/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Hi Brenden (anyone?)
Did you have a chance to look at this?
It is a key element of what we are trying to do with the scheduler

Thanks
Kevin


On Tuesday, 27 January 2015 16:21:27 UTC, Kevin Taylor wrote:
This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]

Brenden Matthews

unread,
Jan 29, 2015, 9:28:58 AM1/29/15
to Kevin Taylor, chronos-...@googlegroups.com
Woops, I forgot to reply.

Yes, this was a bug in Chronos. I've patched it here: https://github.com/mesos/chronos/pull/352

Kevin Taylor

unread,
Jan 29, 2015, 11:57:07 AM1/29/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Nice. Thanks. I'll give this a run out tomorrow.
Kevin


On Thursday, 29 January 2015 14:28:58 UTC, Brenden Matthews wrote:
Woops, I forgot to reply.

Yes, this was a bug in Chronos. I've patched it here: https://github.com/mesos/chronos/pull/352
On Thu, Jan 29, 2015 at 1:22 AM, Kevin Taylor <kevin.ta...@gmail.com> wrote:
Hi Brenden (anyone?)
Did you have a chance to look at this?
It is a key element of what we are trying to do with the scheduler

Thanks
Kevin


On Tuesday, 27 January 2015 16:21:27 UTC, Kevin Taylor wrote:
This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a...@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]

Kevin Taylor

unread,
Feb 2, 2015, 5:27:15 AM2/2/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
I tested the latest maven head build and it appears to work, although I am now getting an akka error on startup of chronos:

[ERROR] [01/30/2015 16:54:46.184] [chronos-actors-akka.actor.default-dispatcher-2] [akka://chronos-actors/user/$a] null
akka.actor.ActorInitializationException: exception during creation
        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

Also, If I do a yum install, I get this version: 2.3.1-0.1.20150122195120

I am assuming from the version format that this is from the 22nd January which wouldn't include this fix.

Can you please tell me what your build policy is in releasing to the mesosphere repo and is there a link which I can use to follow this?

Thanks, Kevin

Max Audet

unread,
Feb 2, 2015, 11:57:41 AM2/2/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Wondering the same here as we also experience retry issues.
thanks !

Brenden Matthews

unread,
Feb 2, 2015, 4:37:42 PM2/2/15
to Kevin Taylor, chronos-...@googlegroups.com
Do you have any more context for that error? There should be a "Caused by: ..." thing below the line you printed.

Kevin Taylor

unread,
Feb 3, 2015, 5:08:34 AM2/3/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Hi Brenden. Full Exception trace:

[ERROR] [01/30/2015 16:54:46.184] [chronos-actors-akka.actor.default-dispatcher-2] [akka://chronos-actors/user/$a] null
akka.actor.ActorInitializationException: exception during creation
        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
        at akka.actor.ActorCell.create(ActorCell.scala:596)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
        at akka.util.Reflect$.instantiate(Reflect.scala:66)
        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
        at akka.actor.Props.newActor(Props.scala:252)
        at akka.actor.ActorCell.newActor(ActorCell.scala:552)
        at akka.actor.ActorCell.create(ActorCell.scala:578)
        ... 9 more
Caused by: java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:443)
        at java.lang.Integer.parseInt(Integer.java:514)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:241)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
        at org.apache.mesos.chronos.notification.MailClient.<init>(MailClient.scala:22)
        ... 18 more

Kevin Taylor

unread,
Feb 3, 2015, 8:46:54 AM2/3/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Hi Brenden

Looks like the line val mailPort = mailPortStr.toInt (line 22) in MailClient.scala
was moved and is now uninitialised

I did a quick local test and put it back to where the mailPortStr is initialised and the problem went away

If you want me to raise an issue on this and put in a pull request, I am happy to do this, but just wanted to check first for any rationale you may have had for shifting it

Thanks for your support. Kevin

Brenden Matthews

unread,
Feb 3, 2015, 10:36:05 AM2/3/15
to Kevin Taylor, chronos-...@googlegroups.com
I actually fixed that in this PR: https://github.com/mesos/chronos/pull/361

I'm going to go ahead and merge it.

Kevin Taylor

unread,
Feb 3, 2015, 10:56:47 AM2/3/15
to chronos-...@googlegroups.com, kevin.ta...@gmail.com
Even better. Thanks

Would you be so kind as to follow up on my other question, which was regarding your yum repo policy - basically I would like to know frequency and whether there is a link to the actual releases into the repo

This information doesn't appear to be on the github, so I assume you build and release through another channel (a link would be good if you have one)

Thanks

Kevin

Brenden Matthews

unread,
Feb 3, 2015, 11:04:59 AM2/3/15
to Kevin Taylor, chronos-...@googlegroups.com
It's an excellent question, and I don't have an answer for you on that. I'll have to figure it out and get back to you.
Reply all
Reply to author
Forward
0 new messages