Resuming Long running jobs during restart

Kandada Boggu

unread,

Apr 8, 2010, 8:11:26 PM4/8/10

to ruote

I am evaluating Route for my project. In the past, I have used
enterprise BPM products like Staffware and Inconcert. I was able to
find most of the standard workflow constructs in Route. I need help to
understand following functionality:

1) Resuming jobs during restart
Lets say a worker gets terminated while processing a participant.
Is the job assigned to another worker(if available)
Is the job resumed upon worker restart?
Is the participant re-processed?

2) Work list management API
What is the API for getting the list of tasks assigned to a user/role
across jobs/processes?
What is the API to acquire/revert/complete a human task?
Does the worklist API scale to handle millions of tasks?(our workflows
can run for years together)

3) What are the guidelines for associating human tasks with users/
roles/groups?

4) Can I add participants dynamically?
Is it possible to add participants to the process definition of the
current job within a participant.

5) Is the ActiveRecord storage participant stable?

John Mettraux

unread,

Apr 8, 2010, 8:39:33 PM4/8/10

to openwfe...@googlegroups.com

Hello,

On Fri, Apr 9, 2010 at 9:11 AM, Kandada Boggu <kandad...@gmail.com> wrote:
> I am evaluating Route for my project. In the past, I have used
> enterprise BPM products like Staffware and Inconcert. I was able to
> find most of the standard workflow constructs in Route.

Indeed,

http://ruote.rubyforge.org/patterns.html

> I need help to
> understand following functionality:

OK. I will try to reply with ruote's vocabulary.

> 1) Resuming jobs during restart
> Lets say a worker gets terminated while processing a participant.
> Is the job assigned to another worker(if available)

No, if the operation/job/message has been taken/reserved by a worker,
it is considered done (by all the workers).

> Is the job resumed upon worker restart?

No, for the same reason.

> Is the participant re-processed?

No, for the same reason.

The solution to this problem (an operation getting stalled), is to
re_apply the stalled operation.

There were discussion for automating re-applies a year ago, but nobody
has been requesting that feature since.

Here is a rough example (test cases) :

http://github.com/jmettraux/ruote/blob/ruote2.1/test/functional/ft_14_re_apply.rb

Rdoc :

http://github.com/jmettraux/ruote/blob/ruote2.1/lib/ruote/engine.rb#L121-141

> 2) Work list management API
> What is the API for getting the list of tasks assigned to a user/role
> across jobs/processes?

Here is an example : http://gist.github.com/360715

---8<---
require 'rubygems'
require 'ruote' # gem install ruote

engine = Ruote::Engine.new(Ruote::Worker.new(Ruote::HashStorage.new()))

sp = engine.register_participant '.+', Ruote::StorageParticipant

pdef = Ruote.process_definition do
concurrence do
alpha
bravo
end
end

wfid0 = engine.launch(pdef)
wfid1 = engine.launch(pdef)

sleep 1

puts "workitems for 'alpha' :"
sp.by_participant('alpha').each do |wi|
puts " - #{wi.fei.wfid} #{wi.fei.expid} #{wi.participant_name}"
end

puts "workitems for 'bravo' :"
sp.by_participant('bravo').each do |wi|
puts " - #{wi.fei.wfid} #{wi.fei.expid} #{wi.participant_name}"
end
--->8---

> What is the API to acquire/revert/complete a human task?

There is no "out of the box" 'acquire' implementation. I tend to
implement via a workitem field named 'user', when empty (nil) the
workitem is not acquired. When acquired, it contains the name of the
user.

Many possibilities.

From the point of view of the human, reverting/completing a task is
the same thing : returning the workitem. When reverting, you'd flag a
workitem field like 'reverted' => true.

When the engine itself reverts a task/workitem, it's called "cancel".

---8<---
wi = sp.by_participant('alfred').first
wi.fields['supervised_by'] = 'alfred'
sp.reply(wi)
--->8---

grabs a workitem, adds a field to it and hands to back to the engine
(via the storage participant).

> Does the worklist API scale to handle millions of tasks?(our workflows
> can run for years together)

It depends on the storage you use. I personally haven't tried that far.

> 3) What are the guidelines for associating human tasks with users/
> roles/groups?

Common sense and experience.

I tend to go for "one catch all participant for human tasks" and
participants with specific names (or regexes) for automated tasks. I
tend to use role names for human participants and action names for
automated participants.

It depends on the field / application scope.

> 4) Can I add participants dynamically?

Yes.

> Is it possible to add participants to the process definition of the
> current job within a participant.

I don't understand the question.

> 5) Is the ActiveRecord storage participant stable?

No, it is currently not actively maintained.

http://ruote.rubyforge.org/configuration.html#storage

I'd suggest using ruote-dm which will be very improved for the
upcoming 2.1.10 release.

Other storages are worth a look as well.

Best regards,

--
John Mettraux - http://jmettraux.wordpress.com

Kandada Boggu

unread,

Apr 8, 2010, 10:48:28 PM4/8/10

to ruote

Thanks for the detailed reply. I have some additional questions:

> > 1) Resuming jobs during restart
> > Lets say a worker gets terminated while processing a participant.
> > Is the job assigned to another worker(if available)
>
> No, if the operation/job/message has been taken/reserved by a worker,
> it is considered done (by all the workers).
>
> > Is the job resumed upon worker restart?
>
> No, for the same reason.
>
> > Is the participant re-processed?
>

> No, for the same reason.I

Let me try to pose the question with a real life example of insurance
claims processing.
System starts a job upon receiving a claims request. The job has a
human task that is assigned
to ClaimsAgent role. When a claims agent checks his work queue he will
see the new claims request.
If he chooses to acquire the claim, then that task is taken off the
queue. Actual processing time
for the claim runs in to several days. Once the agent completes the
task the job moves to the next step.
In this scenario:
1) Does the system support asynchronous completion of tasks? Or Does
it simply perform a synchronous wait for completion?
2) What happens in the above scenario if the worker goes down before
the task is completed? ( this was answered partly in your
previous reply, but I am hoping that additional information might
help)

> From the point of view of the human, reverting/completing a task is
> the same thing : returning the workitem. When reverting, you'd flag a
> workitem field like 'reverted' => true.
>

Who is supposed to put the task back to the queue? In my example,
claims agent(or an admin) can
put the claim back to the claims processing queue.

> > 4) Can I add participants dynamically?
>
> Yes.
>
> > Is it possible to add participants to the process definition of the
> > current job within a participant.
>
> I don't understand the question.

This is case where you are adding new tasks to a running job.
Example: HR department has a workflow to handle employee complaints.
Depending upon the nature of the complaint a sub work flow is
executed.
The sub workflows are defined in a library. Every season new sub
workflows are added to the library based on the new corporate
directive.
Actual sub workflow to be executed for a complaint is determined by a
human operative. The human might make different decisions for the same
scenario
based on the context. Is it possible to insert a new task based to a
running job?
In other workflow systems I have used before, I could modify the
current job by inserting new tasks to the job. Other interesting
aspect this requirement
is a concept called design by discovery. Organizations used to run the
workflow in the dynamic mode and after several months formalize the
most optimal path as the official process.

Thanks
Kandada Boggu

John Mettraux

unread,

Apr 8, 2010, 11:10:50 PM4/8/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 11:48 AM, Kandada Boggu <kandad...@gmail.com> wrote:
>
> Let me try to pose the question with a real life example of insurance
> claims processing.
> System starts a job upon receiving a claims request. The job has a
> human task that is assigned
> to ClaimsAgent role. When a claims agent checks his work queue he will
> see the new claims request.
> If he chooses to acquire the claim, then that task is taken off the
> queue. Actual processing time
> for the claim runs in to several days. Once the agent completes the
> task the job moves to the next step.
>
> In this scenario:
> 1) Does the system support asynchronous completion of tasks? Or Does
> it simply perform a synchronous wait for completion?

Everything is asynchronous.

The system waits indefinitely unless a timeout [4] is set for the
participant or the segment of process.

In case of timeout, the engine engines cancles the segment of progress
and trigger :on_cancel if present [1] .

When the cancelled segment of process contains an applied participant,
a cancel notification is sent to the participant implementation. In
the case of a storage participant the workitem is removed.

By default the flow resumes after the timedout segment.

> 2) What happens in the above scenario if the worker goes down before
> the task is completed? ( this was answered partly in your
> previous reply, but I am hoping that additional information might
> help)

The operation will be placed in the execution queue, waiting to be
fetched and processed by a worker. If there is no worker it will just
stay there.

Outdated timeouts are triggered when a worker is available.

>> From the point of view of the human, reverting/completing a task is
>> the same thing : returning the workitem. When reverting, you'd flag a
>> workitem field like 'reverted' => true.
>
> Who is supposed to put the task back to the queue? In my example,
> claims agent(or an admin) can
> put the claim back to the claims processing queue.

Exactly.

>> > Is it possible to add participants to the process definition of the
>> > current job within a participant.
>>
>> I don't understand the question.
>
> This is case where you are adding new tasks to a running job.
> Example: HR department has a workflow to handle employee complaints.
> Depending upon the nature of the complaint a sub work flow is
> executed.
> The sub workflows are defined in a library. Every season new sub
> workflows are added to the library based on the new corporate
> directive.
> Actual sub workflow to be executed for a complaint is determined by a
> human operative. The human might make different decisions for the same
> scenario
> based on the context. Is it possible to insert a new task based to a
> running job?
> In other workflow systems I have used before, I could modify the
> current job by inserting new tasks to the job. Other interesting
> aspect this requirement
> is a concept called design by discovery. Organizations used to run the
> workflow in the dynamic mode and after several months formalize the
> most optimal path as the official process.

Yes, you can.

I've seen people using loop and cursor [2] or concurrent-iterator [3]
to add new branches.

Please read the documentation. Ruote is very dynamic and since it's
based on Ruby, it's even more dynamic.

You'll find help for modeling and running processes here. We're open
to suggestions, critiques, etc, but please read the documentation
first.

[1] http://ruote.rubyforge.org/common_attributes.html#on_cancel
[2] http://ruote.rubyforge.org/exp/cursor.html
[3] http://ruote.rubyforge.org/exp/concurrent_iterator.html
[4] http://ruote.rubyforge.org/common_attributes.html#timeout

John Mettraux

unread,

Apr 8, 2010, 11:37:10 PM4/8/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 12:10 PM, John Mettraux <jmet...@openwfe.org> wrote:
> On Fri, Apr 9, 2010 at 11:48 AM, Kandada Boggu <kandad...@gmail.com> wrote:
>
>> 2) What happens in the above scenario if the worker goes down before
>> the task is completed? ( this was answered partly in your
>> previous reply, but I am hoping that additional information might
>> help)
>
> The operation will be placed in the execution queue, waiting to be
> fetched and processed by a worker. If there is no worker it will just
> stay there.
>
> Outdated timeouts are triggered when a worker is available.

Sorry, I was wrong in my reply to your first email.

"re_apply" is for cases where there is an external participant
[implementation] that loses its workitem. With re_apply, you get the
engine to resend the workitem to the participant.

Kandada Boggu

unread,

Apr 9, 2010, 12:42:48 AM4/9/10

to ruote

Thanks lot. I will spend next few days reading the documents and the
code.

Cheers
Kandada Boggu

On Apr 8, 8:37 pm, John Mettraux <jmettr...@openwfe.org> wrote:
> On Fri, Apr 9, 2010 at 12:10 PM, John Mettraux <jmettr...@openwfe.org> wrote:

Olle

unread,

Apr 9, 2010, 7:18:48 AM4/9/10

to ruote

John,

I also think it would be great if there will be some option to re-
run killed long-running jobs. It makes the engine more reliable. I'm
playing with ruote at the moment and I already faced this problem.

Please consider an example:
---
# encoding: utf-8
require 'rubygems'
require 'ruote'
require 'ruote/storage/fs_storage'

engine = Ruote::Engine.new(
Ruote::Worker.new(
Ruote::FsStorage.new('ruote_stalled')))

s = engine.register_participant 'sleepy' do |workitem|
puts 'before'
#here could be a call to web-service to put workitem in external
task system. So any kind of waitings are possible. If worker is killed
before it successfully sent information, there would be no item in
external task system and it never had a chance for run again.
sleep 30
puts "after"
end

# registering participants
if engine.processes.count == 0 then
puts 'No processes, launching...'
s.do_not_thread = false
# defining a process
pdef = Ruote.process_definition :name => 'test' do
sequence do
sleepy
end
end
wfid = engine.launch(pdef)
else
wfid = engine.processes[0].wfid
puts "#{engine.processes.count} processes, waiting (#{wfid}) ..."
end

engine.wait_for(wfid)
--------

Run it first time and kill the process. Then run again. Stalled
participant never runs again.
If it's possible, please consider my request.

Best regards, Oleg

John Mettraux

unread,

Apr 9, 2010, 7:27:45 AM4/9/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 8:18 PM, Olle <foe...@gmail.com> wrote:
>
> Run it first time and kill the process. Then run again. Stalled
> participant never runs again.
> If it's possible, please consider my request.

Hello Oleg,

as explained in this email :

http://groups.google.com/group/openwferu-users/msg/3c07b591a6912f42

There is a engine.re_apply(fei) which "re-applies" stalled expressions.

I'm open to suggestions to make it better.

John Mettraux

unread,

Apr 9, 2010, 7:40:31 AM4/9/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 8:27 PM, John Mettraux <jmet...@openwfe.org> wrote:
> On Fri, Apr 9, 2010 at 8:18 PM, Olle <foe...@gmail.com> wrote:
>>
>> Run it first time and kill the process. Then run again. Stalled
>> participant never runs again.
>> If it's possible, please consider my request.
>
> Hello Oleg,
>
> as explained in this email :
>
> http://groups.google.com/group/openwferu-users/msg/3c07b591a6912f42
>
> There is a engine.re_apply(fei) which "re-applies" stalled expressions.
>
> I'm open to suggestions to make it better.

For brittle participants, one useful technique would be :on_timeout => 'redo' :

http://github.com/jmettraux/ruote/blob/ruote2.1/test/functional/ft_15_timeout.rb#L80-82
http://ruote.rubyforge.org/common_attributes.html#on_timeout

Serves as an auto-re_apply. Use with care.

Olle

unread,

Apr 9, 2010, 7:53:40 AM4/9/10

to ruote

Let me ask another question. How (and is it possible) to identify
"partially executed participant" using information from engine? We
have some custom participant. Engine calls "consume", but during this
call something goes wrong and engine get failed. How to separate such
kind of "partially executed participant" from other ones?

> No, if the operation/job/message has been taken/reserved by a worker,
> it is considered done (by all the workers).

May be engine "marks" expressions after returning from "consume" for
futher filtering?

I guess, other way is to create participant that set some workitem
field before returning from consume and filter based on this field.

p.s. John, do you remember our discussion about 'ruote on windows'? We
switched to Linux for now and put windows-related work to intermediate
web-service. ))).

Best regards,
Oleg

John Mettraux

unread,

Apr 9, 2010, 8:17:31 AM4/9/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 8:53 PM, Olle <foe...@gmail.com> wrote:
> Let me ask another question. How (and is it possible) to identify
> "partially executed participant" using information from engine? We
> have some custom participant. Engine calls "consume", but during this
> call something goes wrong and engine get failed. How to separate such
> kind of "partially executed participant" from other ones?
>
>> No, if the operation/job/message has been taken/reserved by a worker,
>> it is considered done (by all the workers).
>
> May be engine "marks" expressions after returning from "consume" for
> futher filtering?
>
> I guess, other way is to create participant that set some workitem
> field before returning from consume and filter based on this field.

Hello Oleg,

Understood.

I found this old thread of discussion :

http://groups.google.com/group/openwferu-users/browse_thread/thread/c2aa4b53d1664d45/8523a1a5ee98fd71

I want to implement the technique discussed in it to cover this
aspect. Thanks for lobbying me again about it.

Stay tuned.

> p.s. John, do you remember our discussion about 'ruote on windows'? We
> switched to Linux for now and put windows-related work to intermediate
> web-service. ))).

I can now sleep soundly ;-)

Cheers,

John Mettraux

unread,

Apr 13, 2010, 11:58:30 PM4/13/10

to openwfe...@googlegroups.com

On Fri, Apr 9, 2010 at 9:17 PM, John Mettraux <jmet...@openwfe.org> wrote:
> On Fri, Apr 9, 2010 at 8:53 PM, Olle <foe...@gmail.com> wrote:
>>
>> Let me ask another question. How (and is it possible) to identify
>> "partially executed participant" using information from engine? We
>> have some custom participant. Engine calls "consume", but during this
>> call something goes wrong and engine get failed. How to separate such
>> kind of "partially executed participant" from other ones?
>>
>>> No, if the operation/job/message has been taken/reserved by a worker,
>>> it is considered done (by all the workers).
>>
>> May be engine "marks" expressions after returning from "consume" for
>> futher filtering?
>>
>> I guess, other way is to create participant that set some workitem
>> field before returning from consume and filter based on this field.
>

> Understood.
>
> I found this old thread of discussion :
>
> http://groups.google.com/group/openwferu-users/browse_thread/thread/c2aa4b53d1664d45/8523a1a5ee98fd71
>
> I want to implement the technique discussed in it to cover this
> aspect. Thanks for lobbying me again about it.
>
> Stay tuned.

Hello Oleg,

I have implemented the first part of the idea :

http://github.com/jmettraux/ruote/commit/0632032d36a61a1358d1d41e4ca3b2a7fd72aa9c

It now flags a participant expressions with "dispatched = true" right
after the dispatch is done. Thus the flag will not be set if the
system crashes during the dispatch.

Note that for a participant that threads, the dispatched flagging is
done right after the thread is launched. Hence in,

http://github.com/jmettraux/ruote/blob/0632032d36a61a1358d1d41e4ca3b2a7fd72aa9c/test/functional/eft_3_participant.rb#L122-148

to ease my tester life, I set the do_not_thread to false so that I can
test the dispatched flag immediately.

Next step is about automating the retry (or not).

re_apply for all the participants that are not dispatched.

I'm wondering if it should be a default operation for a [re]starting
engine or if it should be an engine method like
engine.re_apply_stalled_participants

Not sure. What do you guys think ?

Olle

unread,

Apr 14, 2010, 3:07:29 PM4/14/10

to ruote

Hello, John!

>
> I have implemented the first part of the idea :
>

Very good news, thank you!

>
> Note that for a participant that threads, the dispatched flagging is
> done right after the thread is launched.

I don't know much about ruby threads implementation... Does it mean
that participant expression can be flagged as dispatched before it
really executed 'consume' in case of threaded dispatch?

>
> I'm wondering if it should be a default operation for a [re]starting
> engine or if it should be an engine method like
> engine.re_apply_stalled_participants
>
> Not sure. What do you guys think ?
>

Same for me, not sure ).
Imho, it's better to re-apply by default in most of cases. But I can
also imagine a participant implementation which may not be re-applied
without hand work before...

---
Best regards, Oleg

John Mettraux

unread,

Apr 14, 2010, 8:25:03 PM4/14/10

to openwfe...@googlegroups.com

On Thu, Apr 15, 2010 at 4:07 AM, Olle <foe...@gmail.com> wrote:
>
>> Note that for a participant that threads, the dispatched flagging is
>> done right after the thread is launched.
>
> I don't know much about ruby threads implementation... Does it mean
> that participant expression can be flagged as dispatched before it
> really executed 'consume' in case of threaded dispatch?

Hello Oleg,

you are right.

Fixed : http://github.com/jmettraux/ruote/commit/c0368492d5ebd983bc28cbb10303cd98a89fd77b

>> I'm wondering if it should be a default operation for a [re]starting
>> engine or if it should be an engine method like
>> engine.re_apply_stalled_participants
>>
>> Not sure. What do you guys think ?
>>
>
> Same for me, not sure ).
> Imho, it's better to re-apply by default in most of cases. But I can
> also imagine a participant implementation which may not be re-applied
> without hand work before...

Maybe I can provide this re_apply_stalled_participants and make sure
it only re_applies participant expressions that are at lease two
minute old (to avoid re_applying just launched processes).

Many thanks !

Olle

unread,

Apr 15, 2010, 6:00:28 AM4/15/10

to ruote

John,

>
> Maybe I can provide this re_apply_stalled_participants and make sure
> it only re_applies participant expressions that are at lease two
> minute old (to avoid re_applying just launched processes).
>

It supposed to be the most flexible decision. Users will have a way to
do re_applay_stalled in any desired manner.

Best regards,
Oleg!

Reply all

Reply to author

Forward