After updating to the SNAPSHOT recently, to solve the delay after job issue that Alex reported, I've noticed that checkpointing is not working as it used to. If a large job with checkpointing fails in the middle, the checkpointed outputs have _SUCCESS_JOB files but not _SUCCESS files. When the job is restarted, checkpointing does not look for _SUCCESS_JOB and instead restarts from the beginning, discarding the intermediate checkpoints from the last run. Previously we used the checkpointing to retry after failure of large jobs because our cluster has plenty of transient errors, but it's broken.
Anyway, this is what looks to be the issue to me but I may have missed something.
Thanks,
Joe