Scheduling jobs in the past

140 views
Skip to first unread message

Vikram Oberoi

unread,
Sep 22, 2010, 8:27:26 PM9/22/10
to azkab...@googlegroups.com
Hey folks,

I'm going to introduce the notion of scheduling jobs in the past in Azkaban. The change is actually easy to make, but it's confusing and potentially contentious. I'd like some feedback on: whether this is a good idea, whether this should be in Azkaban, and whether it should even be implemented the way I'm proposing.

What does "scheduling in the past" mean? Let's take a simple example:

It's September 20th. I've implemented a Pig job that grabs 7-day trailing uniques for product X, and I want it to run daily at midnight. However, I want to grab this data from August 1st onward. I want to be able to tell Azkaban to schedule this job to run on August 1st and run it daily. My expected behavior is that Azkaban will immediately begin running the job and schedule it to run every day on recurring basis. Until the job catches up to the current day, it'll keep executing immediately.

The sole reason I want to do this is because I want the correct date to interpolate in my logfile path strings. That's it. I don't want to write a separate script to execute a job flow I've already defined for Azkaban.

How will you implement this?

My master branch for Azkaban already has a change to track the scheduled time in such a way. The rule for this scheduled time is that every consecutive job's scheduled time is the previous job's scheduled time + the period.

It turns out that there's nothing in the Azkaban codebase that prevents jobs from being scheduled in the past. We just don't expose that functionality on the frontend. All I'm going to do is expose a way to set the date you schedule a job on the frontend.

Do you have any objections to sticking this in the UI?

Vikram

Eric Tschetter

unread,
Sep 23, 2010, 1:23:55 PM9/23/10
to azkab...@googlegroups.com
Backfill of processing is definitely something that should be
automated/automatable. This seems like a good way to make it happen. I
have slight reservations in that it this mechanism will end up as something
people program against. That's not necessarily a bad thing for anyone
making new jobs now, but, for example, old jobs written for a different
model won't necessarily be able to take advantage of it. Then again, I'm
not sure that's a problem we need to solve either...

Given how simple just scheduling things in the past is, I think I'd be
behind providing that functionality as a quick way to get back-fill
capabilities. But, I'm not sure that it's the way things will/should be
done in some future world where Azkaban is perfect and world hunger will be
solved.

So, I guess that despite some objections, for lack of a "better" solution,
I'm +1 for getting it out there and then we can let whatever pains are faced
inform decisions on a better solution (if, in fact, there is one).

--Eric


On 9/22/10 5:27 PM, "Vikram Oberoi" <vob...@gmail.com> wrote:

> Hey folks,
>
> I'm going to introduce the notion of scheduling jobs in the past in Azkaban.
> The change is actually easy to make, but it's confusing and potentially
> contentious. I'd like some feedback on: whether this is a good idea, whether
> this should be in Azkaban, and whether it should even be implemented the way
> I'm proposing.
>

> *What does "scheduling in the past" mean?* Let's take a simple example:


>
> It's September 20th. I've implemented a Pig job that grabs 7-day trailing
> uniques for product X, and I want it to run daily at midnight. However, I
> want to grab this data from August 1st onward. I want to be able to tell
> Azkaban to schedule this job to run on August 1st and run it daily. My
> expected behavior is that Azkaban will immediately begin running the job and
> schedule it to run every day on recurring basis. Until the job catches up to
> the current day, it'll keep executing immediately.
>
> The sole reason I want to do this is because I want the correct date to
> interpolate in my logfile path strings. That's it. I don't want to write a
> separate script to execute a job flow I've already defined for Azkaban.
>

> *How will you implement this?*
> *
> *


> My master branch for Azkaban already has a change to track the scheduled

> time in such a way. The rule for this scheduled time is that *every


> consecutive job's scheduled time is the previous job's scheduled time + the

> period*.

Vikram Oberoi

unread,
Sep 23, 2010, 9:49:32 PM9/23/10
to azkab...@googlegroups.com
Great!


It's been merged into http://github.com/voberoi/azkaban/tree/master, which contains code merged from all the other branches in my fork.

Cheers,
Vikram
Reply all
Reply to author
Forward
0 new messages