Checkpointing broken in SNAPSHOT (_SUCCESS_JOB not checked)

22 views
Skip to first unread message

Joseph Beynon

unread,
May 13, 2014, 1:14:56 PM5/13/14
to scoob...@googlegroups.com
After updating to the SNAPSHOT recently, to solve the delay after job issue that Alex reported, I've noticed that checkpointing is not working as it used to. If a large job with checkpointing fails in the middle, the checkpointed outputs have _SUCCESS_JOB files but not _SUCCESS files. When the job is restarted, checkpointing does not look for _SUCCESS_JOB and instead restarts from the beginning, discarding the intermediate checkpoints from the last run. Previously we used the checkpointing to retry after failure of large jobs because our cluster has plenty of transient errors, but it's broken.
Anyway, this is what looks to be the issue to me but I may have missed something.
Thanks,
Joe

Eric Torreborre

unread,
May 16, 2014, 8:29:52 AM5/16/14
to scoob...@googlegroups.com
Sorry Joseph,

I don't have time to check this right now but can you please try the latest released version, 0.8.4? There were some breaks in the recent 0.9.0-SNAPSHOTs, including how SUCCESS files are being moved.
With 0.8.4 all our checkpoints tests are currently passing so maybe the problem lies with the SNAPSHOT version you have.

If there is still an issue I'll have a look at it.

Thanks,

Eric.

Joseph Beynon

unread,
May 17, 2014, 1:50:55 PM5/17/14
to scoob...@googlegroups.com
It looks like 0.8.4 doesn't solve the problem. The _SUCCESS_JOB change was made between 0.8.3 and 0.8.4. The problem is not in the basic checkpointing itself, but if a half completed chain fails, output that was done mid-chain does not get picked up. So if one call to persist creates output paths "a" and "b", both with checkpointing enabled, and "a" completes successfully we now have "a/_SUCCESS_JOB" on HDFS. If "b" fails then "a/_SUCCESS_JOB" doesn't get changed to "a/_SUCCESS". When we retry the chain, checkpointing looks and doesn't see "a/_SUCCESS" because it doesn't exist, and re-runs the entire chain, instead of just doing "b".

Eric Torreborre

unread,
May 20, 2014, 3:00:22 AM5/20/14
to scoob...@googlegroups.com
Hi Joseph,

Can you please try the latest 0.9.0-SNAPSHOT and tell me if it works better for you (you'll need to wait a bit for the new jars to be republished by Jenkins, the BuildInfo commit is: df53463)?

If that's the case I'll publish an official version tomorrow.

Thanks,

Eric.

Joseph Beynon

unread,
Jun 6, 2014, 1:54:09 PM6/6/14
to scoob...@googlegroups.com
Sorry It took a while to get to testing this. Looks like things are working better in this version. There was an issue with the job naming where it said "step 0 of x" instead of step 1 but it ran well.

Eric Torreborre

unread,
Jun 8, 2014, 10:34:50 PM6/8/14
to scoob...@googlegroups.com
Good to know. The step numbering should be fine in the most recent snapshot.
Reply all
Reply to author
Forward
0 new messages