I have been looking into batch job failure at Mifos 1.7.x deploy.
http://mifosforge.jira.com/browse/MIFOS-3840
http://mifosforge.jira.com/browse/MIFOS-3841
It seems like that current batch job is unable to recover from
FAILURE. i.e. if last batch job was failed then the next one is also
resulting failure even if the code debug shows returning COMPLETE
status at the end of execute
With database dump from hudson job of Mifos 1.7.x deploy and
putting debug point
org.mifos.framework.components.batchjobs.MifosBatchJob#line152
You can see that for batch job execution which has
"BatchStatus.COMPLETE && ExitStatus.NOOP" is forced to become failure.
Forcing Batch job to be failure even if it's completed and No
operation are required, it probably going to stuck at this condition
as the next try to recover will also be forced to be a failure.
There is a catch(JobInstanceAlreadyCompleteException).
org.mifos.framework.components.batchjobs.MifosBatchJob.launchJob(Job,
JobParameters, BatchStatus)
But in case of "BatchStatus.COMPLETE && ExitStatus.NOOP" there will be
no exception thrown.
http://paste.ubuntu.com/524281/
Udai
PS: these are some notes
- Upgrade1283765911.java is resposible for migration of data from
scheduled_tasks to new BATCH_* tables
- The migrated data can be identified by query "SELECT * FROM
BATCH_JOB_EXECUTION B where last_updated is null"
- Upgrade1283765911.java is migrating only latest job executed.
* QUERY USED - select taskname, max(starttime) from scheduled_tasks
where status = 1 and description = 'Finished Successfully' group by
taskname
* WHY NOT USE - select taskname, starttime, endtime from
scheduled_tasks where status = 1 and description = 'Finished
Successfully' group by starttime desc
* We should exclude the batch jobs which has been removed till now in 1.7.x
* Is the purpose of this upgrade script to preserve historical
information or just to provide a start point?
------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
thanks for the investigation. You are right that if the previous job
instance status of "BatchStatus.COMPLETE && ExitStatus.NOOP" enforces
failure of the next job instance, then it is a serious bug. However, the
interesting question here is if all the steps of such job instance were
really successfully finished.
As far as I know the purpose of upgrade script was just to provide a
start point for the new scheduler implementation.
Regards,
Jakub.
On 02.11.2010 07:26, Udai Gupta wrote:
> Hi,
>
> I have been looking into batch job failure at Mifos 1.7.x deploy.
>
> http://mifosforge.jira.com/browse/MIFOS-3840
> http://mifosforge.jira.com/browse/MIFOS-3841
>
> It seems like that current batch job is unable to recover from
> FAILURE. i.e. if last batch job was failed then the next one is also
> resulting failure even if the code debug shows returning COMPLETE
> status at the end of execute
>
> With database dump from hudson job of Mifos 1.7.x deploy and
> putting debug point
> org.mifos.framework.components.batchjobs.MifosBatchJob#line152
> You can see that for batch job execution which has
> "BatchStatus.COMPLETE&& ExitStatus.NOOP" is forced to become failure.
> Forcing Batch job to be failure even if it's completed and No
> operation are required, it probably going to stuck at this condition
> as the next try to recover will also be forced to be a failure.
>
> There is a catch(JobInstanceAlreadyCompleteException).
> org.mifos.framework.components.batchjobs.MifosBatchJob.launchJob(Job,
> JobParameters, BatchStatus)
> But in case of "BatchStatus.COMPLETE&& ExitStatus.NOOP" there will be
> no exception thrown.
>
> http://paste.ubuntu.com/524281/
>
> Udai
>
>
> PS: these are some notes
>
> - Upgrade1283765911.java is resposible for migration of data from
> scheduled_tasks to new BATCH_* tables
> - The migrated data can be identified by query "SELECT * FROM
> BATCH_JOB_EXECUTION B where last_updated is null"
> - Upgrade1283765911.java is migrating only latest job executed.
> * QUERY USED - select taskname, max(starttime) from scheduled_tasks
> where status = 1 and description = 'Finished Successfully' group by
> taskname
> * WHY NOT USE - select taskname, starttime, endtime from
> scheduled_tasks where status = 1 and description = 'Finished
> Successfully' group by starttime desc
> * We should exclude the batch jobs which has been removed till now in 1.7.x
> * Is the purpose of this upgrade script to preserve historical
> information or just to provide a start point?
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps& games for the Nokia N8 for consumers in U.S. and Canada
> thanks for the investigation. You are right that if the previous job
> instance status of "BatchStatus.COMPLETE && ExitStatus.NOOP" enforces
> failure of the next job instance, then it is a serious bug. However, the
> interesting question here is if all the steps of such job instance were
> really successfully finished.
I am proposing this patch.
http://paste.ubuntu.com/524281/
to make sure that next batch jobs won't be forced to be FAILURE.
Will you help me figuring out if this is correct?
Now, about the question "if all the steps of such job instance were
really successfully finished."
I am not sure. There are various factors like
* Date of batch job
- There are some batch jobs which are able to recover
regardless of date, but some dev
* DataSource/Hibernate error
- There are some problems about time-out of connection
I have to see recoverability(is actually able to recover if failed) of
each job. I think there are only one step associated with every batch
job in Mifos which is MifosScheduler#execute(Timestamp)
> As far as I know the purpose of upgrade script was just to provide a
> start point for the new scheduler implementation.
Okay, I got confused because of this
http://mifosforge.jira.com/wiki/display/MIFOS/Quartz+Batch+Jobs (5th section)
Thanks,
Udai
thanks for the patch! :)
I will take a look at it and send you my feedback tomorrow probably.
Regards,
Jakub.
On 02.11.2010 09:19, Udai Gupta wrote:
> Hi Jakub,
>
>
>> thanks for the investigation. You are right that if the previous job
>> instance status of "BatchStatus.COMPLETE&& ExitStatus.NOOP" enforces
> Create new apps& games for the Nokia N8 for consumers in U.S. and Canada
latest patch
http://paste.ubuntu.com/524599
> latest patch
> http://paste.ubuntu.com/524599
this patch was applied to the master and 1.7.x after discussing with Jakub.
http://mifosforge.jira.com/browse/MIFOS-4081
Conclusion after looking into the existing batch jobs and the date used by them
This came out of the discussion on how to recover if a batch job fails.
http://mifosforge.jira.com/browse/MIFOS-4082
ApplyHolidayChangesTaskJob - Doesn't matter which date is passed
SavingsIntPostingTaskJob - uses the fireTime passed
LoanArrearsAgingTaskJob - Depends on LoanArrearsTask (which is part of
LoanArrearsAndPortfolioAtRiskTaskJob), it uses the current system date
ApplyCustomerFeeChangesTaskJob - Doesn't matter which date is passed
BranchReportTaskJob - It uses the fireDate (timeInMilis)
LoanArrearsAndPortfolioAtRiskTaskJob - It uses current system Date
ProductStatusJob - It uses the fireDate (timeInMilis)
GenerateMeetingsForCustomerAndSavingsTaskJob - It uses current system Date
fireDate - is the date on which the batch jobs was suppose to run
which is stored in the BATCH* table, so if failure occurs next catch
up will use the fireDate
The catch up works well with the batch jobs which uses the fireDate,
We need to move system date to recover failure of the batch jobs which
uses system date.
These pages need to be updated.
http://mifos.org/functional-specifications/system-processing/batch-jobs
http://mifos.org/documentation/system-administration/managing-batch-jobs
The interesting question which Jakub raised was "Should we remove the
catchup mechanism completely?"
I would say we keep the catch up mechanism and make sure that the
batch jobs which depends on the date should use fireDate and not the
system date.
LoanArrearsAgingTaskJob (also resolve the dependency problem with
other batch jobs)
LoanArrearsAndPortfolioAtRiskTaskJob
GenerateMeetingsForCustomerAndSavingsTaskJob
Thanks,
Udai
------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a
Billion" shares his insights and actions to help propel your
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev
>
>The catch up works well with the batch jobs which uses the fireDate,
>We need to move system date to recover failure of the batch jobs which
>uses system date.
So does this mean that a failed batch job that uses system date e.g.
GenerateMeetingsForCustomerAndSavingsTaskJob, if rerun via the Batch
Jobs admin page won't really catch up?
>
>These pages need to be updated.
>http://mifos.org/functional-specifications/system-processing/batch-jobs
>http://mifos.org/documentation/system-administration/managing-batch-job
s
I suggest we move these instructions to the developer wiki incorporating
with the documentation Jakub previously provided -
http://mifosforge.jira.com/wiki/display/MIFOS/Quartz+Batch+Jobs. I
created this task for someone to pick up -
http://mifosforge.jira.com/browse/MIFOS-4100.
>
>I would say we keep the catch up mechanism and make sure that the
>batch jobs which depends on the date should use fireDate and not the
>system date.
Can you capture what's left to do in new jira issues? Until they use
fireDate, does that mean they can't be caught up via the UI as I asked
above?
Regards,
Jeff
> So does this mean that a failed batch job that uses system date e.g.
> GenerateMeetingsForCustomerAndSavingsTaskJob, if rerun via the Batch
> Jobs admin page won't really catch up?
Suppose, Mifos is down for 5 days (or batch job failed for 5 days). On
the 6th day batch job runs, the catchup mechanism will run the batch
job 5 more times by passing argument "fireDate equals to the date
batch job was suppose to run" to the execute method (also called as
jobParameter Map). If batch job is using the system date then all the
extra 5 run on 6th day will not be same as if batch job would have run
5 days regularly. In this case you would have to move the system date
5 times and run the Batch Job 5 time from the admin page to catch up
actually.
So, if batch job uses the fireDate instead of system date then catch
up for last 5 days will be done by 6th day run (or just by one click
form the job admin page).
> I suggest we move these instructions to the developer wiki incorporating
> with the documentation Jakub previously provided -
> http://mifosforge.jira.com/wiki/display/MIFOS/Quartz+Batch+Jobs. I
> created this task for someone to pick up -
> http://mifosforge.jira.com/browse/MIFOS-4100.
Cool, much better than updating those pages. :)
Udai