Retry intervals

Kevin Taylor

unread,

Jan 20, 2015, 5:52:32 AM1/20/15

to chronos-...@googlegroups.com

Is it possible to configure the retry operation?

It appears that you can specify the number of retries in a failure scenario but not the interval between each retry.

As an example I kick of a job which polls for a given file before its dependents are then scheduled.

If the script returns a non-zero condition, I would then like it to wait for a specified period before trying again.

If it reaches the maximum retries in that given period, then I would treat this as a failure

Any advice on this, or intention of including this feature?

Regards, Kevin

Kevin Taylor

unread,

Jan 20, 2015, 7:13:31 AM1/20/15

to chronos-...@googlegroups.com

I have also noticed that the retry doesn't appear to work on subsequent operations

I had a job which has retries: 5 - Yesterday it executed and failed, then retried 5 times okay. The job is set to run every 24 hours

Today it fired correctly after 24 hours and only executed on a single iteration - it didn't retry 5 times

Is this expected behaviour?

Brenden Matthews

unread,

Jan 20, 2015, 9:48:25 AM1/20/15

to Kevin Taylor, chronos-...@googlegroups.com

That's correct. There is a delay parameter, here: https://github.com/mesos/chronos/blob/1ca23e8a3c22f0f27230f5ef8bf44aa8c46bb3ad/src/main/scala/com/airbnb/scheduler/config/SchedulerConfiguration.scala#L112-L114

It's a global config param, rather than per-job.

If a job did not succeed in the previous run, then it will not be retried (it will only make 1 attempt). This is to prevent flooding the cluster with failed jobs in cases where things go wrong, or when jobs are completely broken. They will also become disabled after enough attempts.

--
You received this message because you are subscribed to the Google Groups "chronos-scheduler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chronos-schedu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Taylor

unread,

Jan 20, 2015, 9:52:23 AM1/20/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Thanks Brenden

Do you foresee any appetite for having a per-job retry config parameter?

Brenden Matthews

unread,

Jan 20, 2015, 9:53:59 AM1/20/15

to Kevin Taylor, chronos-...@googlegroups.com

Pull requests are welcome. It should be fairly simple to add, just follow the existing code and look at some merged PRs.

Kevin Taylor

unread,

Jan 27, 2015, 11:02:26 AM1/27/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

I am having a struggle making the failure retry interval work. As a default the job retries twice and as far as I can see should wait in between jobs for the specified time.

ct:1422371866641:2:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago

ct:1422371865629:1:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago

ct:1422370932000:0:kev ChronosTask:kev FAILED 52 minutes ago 52 minutes ago

My interval is set to 900,000 ms (15 minutes). Looking at the ct: template output this appears to output correctly which is a bit baffling, but the jobs are not waiting for the specified time before retry

In the JobScheduler.scala is appears to call the enqueue method directly which doesn't seem to have any delay mechanism in it, but I am not over familiar with the code base

Have I made some incorrect assumptions?

Thanks for any help

Kevin

Brenden Matthews

unread,

Jan 27, 2015, 11:05:28 AM1/27/15

to Kevin Taylor, chronos-...@googlegroups.com

It's quite possible that it's busted. Can you give me any more logging context? And the JSON of the job (you can grab it from the `/scheduler/jobs` endpoint)? I'll check it out later today.

Kevin Taylor

unread,

Jan 27, 2015, 11:21:27 AM1/27/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a...@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]

This is a snippet from the log, where the retries take place which shows it just retries immediately on failure.

There's a fair amount of other info messages in the log, so not sure if this will be enough. Please let me know if you want the full trace

airbnbchronos_1 | [2015-01-27 16:06:56,924] WARN Starting [class com.airbnb.notification.MailClient] notification client. (com.airbnb.scheduler.config.MainModule:104)
airbnbchronos_1 | [2015-01-27 16:06:57,117] WARN No SSL support configured. (mesosphere.chaos.http.HttpModule:62)
airbnbchronos_1 | [2015-01-27 16:10:58,913] WARN Adding vertex:kev (com.airbnb.scheduler.graph.JobGraph:65)
airbnbchronos_1 | [2015-01-27 16:10:58,914] WARN Current number of vertices:1 (com.airbnb.scheduler.graph.JobGraph:72)
airbnbchronos_1 | [2015-01-27 16:11:02,247] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:02,622] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:02,622] WARN Retrying job: kev, attempt: 0 (com.airbnb.scheduler.jobs.JobScheduler:392)
airbnbchronos_1 | [2015-01-27 16:11:03,220] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:03,628] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:03,629] WARN Retrying job: kev, attempt: 1 (com.airbnb.scheduler.jobs.JobScheduler:392)
airbnbchronos_1 | [2015-01-27 16:11:04,220] WARN Ignoring offered resource: RANGES (com.airbnb.scheduler.mesos.MesosJobFramework:260)
airbnbchronos_1 | [2015-01-27 16:11:04,637] WARN Task of job: kev failed. (com.airbnb.scheduler.jobs.JobScheduler:374)
airbnbchronos_1 | [2015-01-27 16:11:04,662] WARN Job failed beyond retries! (com.airbnb.scheduler.jobs.JobScheduler:438)

Thanks, Kevin

Kevin Taylor

unread,

Jan 29, 2015, 4:22:32 AM1/29/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Hi Brenden (anyone?)

Did you have a chance to look at this?

It is a key element of what we are trying to do with the scheduler

Thanks

Kevin

On Tuesday, 27 January 2015 16:21:27 UTC, Kevin Taylor wrote:

This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]

Brenden Matthews

unread,

Jan 29, 2015, 9:28:58 AM1/29/15

to Kevin Taylor, chronos-...@googlegroups.com

Woops, I forgot to reply.

Yes, this was a bug in Chronos. I've patched it here: https://github.com/mesos/chronos/pull/352

Kevin Taylor

unread,

Jan 29, 2015, 11:57:07 AM1/29/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Nice. Thanks. I'll give this a run out tomorrow.

Kevin

On Thursday, 29 January 2015 14:28:58 UTC, Brenden Matthews wrote:

Woops, I forgot to reply.

Yes, this was a bug in Chronos. I've patched it here: https://github.com/mesos/chronos/pull/352

On Thu, Jan 29, 2015 at 1:22 AM, Kevin Taylor <kevin.ta...@gmail.com> wrote:

Hi Brenden (anyone?)
Did you have a chance to look at this?
It is a key element of what we are trying to do with the scheduler

Thanks
Kevin

On Tuesday, 27 January 2015 16:21:27 UTC, Kevin Taylor wrote:

This is the json - very simple, just calls a non existent command, no other overriders:

[{"name":"kev","command":"a","shell":true,"epsilon":"PT30M","executor":"","executorFlags":"","retries":2,"owner":"a...@a.com","async":false,"successCount":0,"errorCount":1,"lastSuccess":"","lastError":"2015-01-27T16:11:04.637Z","cpus":0.1,"disk":256.0,"mem":128.0,"disabled":false,"softError":false,"errorsSinceLastSuccess":1,"uris":[],"environmentVariables":[],"arguments":[],"highPriority":false,"runAsUser":"root","schedule":"R/2015-01-28T16:10:23.000Z/PT24H","scheduleTimeZone":""}]

Kevin Taylor

unread,

Feb 2, 2015, 5:27:15 AM2/2/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

I tested the latest maven head build and it appears to work, although I am now getting an akka error on startup of chronos:

[ERROR] [01/30/2015 16:54:46.184] [chronos-actors-akka.actor.default-dispatcher-2] [akka://chronos-actors/user/$a] null

akka.actor.ActorInitializationException: exception during creation

at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

Also, If I do a yum install, I get this version: 2.3.1-0.1.20150122195120

I am assuming from the version format that this is from the 22nd January which wouldn't include this fix.

Can you please tell me what your build policy is in releasing to the mesosphere repo and is there a link which I can use to follow this?

Thanks, Kevin

Max Audet

unread,

Feb 2, 2015, 11:57:41 AM2/2/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Wondering the same here as we also experience retry issues.

thanks !

Brenden Matthews

unread,

Feb 2, 2015, 4:37:42 PM2/2/15

to Kevin Taylor, chronos-...@googlegroups.com

Do you have any more context for that error? There should be a "Caused by: ..." thing below the line you printed.

Kevin Taylor

unread,

Feb 3, 2015, 5:08:34 AM2/3/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Hi Brenden. Full Exception trace:

[ERROR] [01/30/2015 16:54:46.184] [chronos-actors-akka.actor.default-dispatcher-2] [akka://chronos-actors/user/$a] null

akka.actor.ActorInitializationException: exception during creation

at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

at akka.actor.ActorCell.create(ActorCell.scala:596)

at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)

at akka.dispatch.Mailbox.run(Mailbox.scala:219)

at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:534)

at akka.util.Reflect$.instantiate(Reflect.scala:66)

at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

at akka.actor.Props.newActor(Props.scala:252)

at akka.actor.ActorCell.newActor(ActorCell.scala:552)

at akka.actor.ActorCell.create(ActorCell.scala:578)

... 9 more

Caused by: java.lang.NumberFormatException: null

at java.lang.Integer.parseInt(Integer.java:443)

at java.lang.Integer.parseInt(Integer.java:514)

at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:241)

at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)

at org.apache.mesos.chronos.notification.MailClient.<init>(MailClient.scala:22)

... 18 more

Kevin Taylor

unread,

Feb 3, 2015, 8:46:54 AM2/3/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Hi Brenden

Looks like the line val mailPort = mailPortStr.toInt (line 22) in MailClient.scala

was moved and is now uninitialised

I did a quick local test and put it back to where the mailPortStr is initialised and the problem went away

If you want me to raise an issue on this and put in a pull request, I am happy to do this, but just wanted to check first for any rationale you may have had for shifting it

Thanks for your support. Kevin

Brenden Matthews

unread,

Feb 3, 2015, 10:36:05 AM2/3/15

to Kevin Taylor, chronos-...@googlegroups.com

I actually fixed that in this PR: https://github.com/mesos/chronos/pull/361

I'm going to go ahead and merge it.

Kevin Taylor

unread,

Feb 3, 2015, 10:56:47 AM2/3/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Even better. Thanks

Would you be so kind as to follow up on my other question, which was regarding your yum repo policy - basically I would like to know frequency and whether there is a link to the actual releases into the repo

This information doesn't appear to be on the github, so I assume you build and release through another channel (a link would be good if you have one)

Thanks

Kevin

Brenden Matthews

unread,

Feb 3, 2015, 11:04:59 AM2/3/15

to Kevin Taylor, chronos-...@googlegroups.com

It's an excellent question, and I don't have an answer for you on that. I'll have to figure it out and get back to you.

Reply all

Reply to author

Forward