Problem of resumption after crash

Eric Platon

unread,

Sep 24, 2010, 4:54:33 AM9/24/10

to ruote

Hi,

I am evaluating how Ruote deals with crashes at the workflow level.
Resumption works, with an unexpected side-effect: The engine does
resume a crashed process, but it also runs the workflow once more!

I am probably doing something wrong. Here is a contrived example to
show the problem.

4 steps
----------
Step 1) Run the examples/ruote_quickstart.rb
#=> I received a message from Alice

Step 2) Modify the process definition (so we have time to crash the
process violently). The definition becomes:
---
pdef = Ruote.process_definition :name => 'test' do
sequence do
participant :alpha
wait :for => '5s' # <= Added statement
participant :bravo
end
end
---

Step 3) Run the new example. When it hits the wait (immediate), kill
the script (e.g. CTRL+C)
#=> No output but the interrupt report from the shell.

Step 4) Run again the new example.
#=> I received a message from Alice\nI received a message from Alice
----------

The output message is produced twice after resumption: Output 1 is the
resumption, output 2 is a second run. Possible explanation at this
point: Ruote::ReceiverMixin#launch puts a "launch" message each time
it starts. On resumption, the worker picks any remaining message from
previous runs, and picks "launch" again for another run (the order
depends on the scheduled waiting time). After studying the API, I
could not find any alternative to using Ruote::ReceiverMixin#launch to
avoid the extra "launch" message.

What is a proper way to resume a crashed process without extra run?

Environment information:
* ruote 2.1.10
* ruby 1.8.7 (2010-01-10 patchlevel 249)

John Mettraux

unread,

Sep 24, 2010, 5:51:48 AM9/24/10

to openwfe...@googlegroups.com

On Fri, Sep 24, 2010 at 01:54:33AM -0700, Eric Platon wrote:
> Hi,

Salut Eric,

it was nice meeting you at the Ruby meetup.

> I am evaluating how Ruote deals with crashes at the workflow level.
> Resumption works, with an unexpected side-effect: The engine does
> resume a crashed process, but it also runs the workflow once more!
>
> I am probably doing something wrong. Here is a contrived example to
> show the problem.

I have the impression it's only a simple misunderstanding, how about this version of the quickstart :

http://gist.github.com/595132

The launch part has been modified like this :

---8<---
WFID_FILE = 'ruote_quickstart_wfid.txt'

wfid = File.read(WFID_FILE).strip rescue nil

wfid ||= engine.launch(pdef)

File.open(WFID_FILE, 'wb') { |f| f.write(wfid) }

engine.wait_for(wfid)
# blocks current thread until our process instance terminates

FileUtils.rm(WFID_FILE)
--->8---

I use variations of this pattern when I want to ensure that only 1 instance of a specific process definition runs at some point (usually it's a process that involves a cron expression ( http://ruote.rubyforge.org/exp/cron.html ).

Usually ruote is used inside of a web application or as a standalone worker and it's supposed to never exit. Processes are running and so on. Processes are not automatically launched as the webapp or the worker starts, they are launched (a bit) later on, triggered by an external event.

Sorry if the quickstart is misleading.

I'd suggest looking at ruote-kit

http://github.com/tosch/ruote-kit

for a web application (rack) that wraps ruote with a decent web interface. Maybe it's easier to get a feel for "ruote, the service" in that context.

So, it's not a bug, the two launch requests were honoured.

Cheers !

--
John Mettraux - http://jmettraux.wordpress.com

Eric Platon

unread,

Sep 24, 2010, 12:15:30 PM9/24/10

to ruote

Hi John,

It was really nice to talk to you at this Ruby event :-) Thanks for
your quick reply!

> I have the impression it's only a simple misunderstanding, how about this version of the quickstart :
> http://gist.github.com/595132

Thank you. This new version is very helpful and appreciated. I thought
about a pid-like file for resumption, but I was expecting a storage-
based approach (see later).

> Usually ruote is used inside of a web application or as a standalone worker and it's supposed to never exit. Processes are running and so on.

This is exactly why Ruote is a good candidate for my target :-) But we
are talking here about resumption on crash. I guess it is a desired
property in many scenarios to get the engine resume "seamlessly" from
some milestones in the process definition, similarly to the statement
in the documentation on implementing storage (http://
ruote.rubyforge.org/implementing_a_storage.html).

> Processes are not automatically launched as the webapp or the worker starts, they are launched (a bit) later on, triggered by an external event.

I will have another question about that, but keep it in a dedicated
thread.

> I'd suggest looking at ruote-kit
> http://github.com/tosch/ruote-kit
> for a web application (rack) that wraps ruote with a decent web interface. Maybe it's easier to get a feel for "ruote, the service" in that context.

Thank you for confirming. It does help in understanding the Ruote
component.

> So, it's not a bug, the two launch requests were honoured.

It is clear that calling #launch initializes a new process and thus
produces the expected behavior (meaning no bug), but I was actually
wondering whether #launch should not be "resume-aware" through the
storage. Not sure what is worth right now. Your extended quickstart is
sound and clear to me, but I feel that it could be more compact: The
storage could be used as a memory across runs, instead of adding a
wfid file. I mean, it looks less elegant than just relying on the
storage, notably when using FsStorage.

That's maybe just me, and it would be a significant rewrite. Still
learning! Going forward. Thanks again for your prompt support!

Eric

John Mettraux

unread,

Sep 25, 2010, 1:39:31 PM9/25/10

to openwfe...@googlegroups.com

On Fri, Sep 24, 2010 at 09:15:30AM -0700, Eric Platon wrote:
>
> It is clear that calling #launch initializes a new process and thus
> produces the expected behavior (meaning no bug), but I was actually
> wondering whether #launch should not be "resume-aware" through the
> storage. Not sure what is worth right now. Your extended quickstart is
> sound and clear to me, but I feel that it could be more compact: The
> storage could be used as a memory across runs, instead of adding a
> wfid file. I mean, it looks less elegant than just relying on the
> storage, notably when using FsStorage.
>
> That's maybe just me, and it would be a significant rewrite. Still
> learning! Going forward. Thanks again for your prompt support!

Hello Eric,

I've been (slowly) preparing something about that :

http://gist.github.com/597086

I hope to finish it by tomorrow.

It's true that the wfid file trick is not adapted to environments where engines in different Ruby runtimes launch such "unique" processes, hence my launch_single() work.

Stay tuned,

John Mettraux

unread,

Sep 25, 2010, 11:56:43 PM9/25/10

to openwfe...@googlegroups.com

On Sun, Sep 26, 2010 at 02:39:31AM +0900, John Mettraux wrote:
>
> On Fri, Sep 24, 2010 at 09:15:30AM -0700, Eric Platon wrote:
> >
> > It is clear that calling #launch initializes a new process and thus
> > produces the expected behavior (meaning no bug), but I was actually
> > wondering whether #launch should not be "resume-aware" through the
> > storage. Not sure what is worth right now. Your extended quickstart is
> > sound and clear to me, but I feel that it could be more compact: The
> > storage could be used as a memory across runs, instead of adding a
> > wfid file. I mean, it looks less elegant than just relying on the
> > storage, notably when using FsStorage.
> >
> > That's maybe just me, and it would be a significant rewrite. Still
> > learning! Going forward. Thanks again for your prompt support!
>

> I hope to finish it by tomorrow.
>
> It's true that the wfid file trick is not adapted to environments where engines in different Ruby runtimes launch such "unique" processes, hence my launch_single() work.

Hello Eric,

thanks for your idea, I've refined the launch_single method into :

http://github.com/jmettraux/ruote/commit/b1d1046b60b4a11ef3b0cb4cc83b368870c34854

The tests look like :

http://github.com/jmettraux/ruote/blob/ruote2.1/test/functional/ft_46_launch_single.rb

Not super happy with the 'single[s]' appellation, but I didn't want to use 'singleton'... Well I could have, I already have 'instances'...

Thanks again,

Eric Platon

unread,

Sep 27, 2010, 4:48:14 AM9/27/10

to ruote

Hi John,

I have just reviewed the code and tested against the contrived example
in this thread. launch_single does exactly what I had in mind, thank
you very much for this addition, and thumbs-up for completing
overnight!

The reserved wfid cases in your first post (http://gist.github.com/
597086) surprised me a bit (are there really reserved ids in Ruote?),
and it seems there is also a risk for infinite looping with the second
return statement. The code is gone anyway, and the latest version
worked fine in a couple of tests.

About your second test case, is there any concrete reason to sleep for
0.4s ?

Eric

On Sep 26, 12:56 pm, John Mettraux <jmettr...@openwfe.org> wrote:
> On Sun, Sep 26, 2010 at 02:39:31AM +0900, John Mettraux wrote:
>
> > On Fri, Sep 24, 2010 at 09:15:30AM -0700, Eric Platon wrote:
>
> > > It is clear that calling #launch initializes a new process and thus
> > > produces the expected behavior (meaning no bug), but I was actually
> > > wondering whether #launch should not be "resume-aware" through the
> > > storage. Not sure what is worth right now. Your extended quickstart is
> > > sound and clear to me, but I feel that it could be more compact: The
> > > storage could be used as a memory across runs, instead of adding a
> > > wfid file. I mean, it looks less elegant than just relying on the
> > > storage, notably when using FsStorage.
>
> > > That's maybe just me, and it would be a significant rewrite. Still
> > > learning! Going forward. Thanks again for your prompt support!
>
> > I hope to finish it by tomorrow.
>
> > It's true that the wfid file trick is not adapted to environments where engines in different Ruby runtimes launch such "unique" processes, hence my launch_single() work.
>
> Hello Eric,
>
> thanks for your idea, I've refined the launch_single method into :
>

> http://github.com/jmettraux/ruote/commit/b1d1046b60b4a11ef3b0cb4cc83b...
>
> The tests look like :
>
> http://github.com/jmettraux/ruote/blob/ruote2.1/test/functional/ft_46...

John Mettraux

unread,

Sep 27, 2010, 5:11:09 AM9/27/10

to openwfe...@googlegroups.com

On Mon, Sep 27, 2010 at 01:48:14AM -0700, Eric Platon wrote:
>
> The reserved wfid cases in your first post (http://gist.github.com/
> 597086) surprised me a bit (are there really reserved ids in Ruote?),
> and it seems there is also a risk for infinite looping with the second
> return statement. The code is gone anyway, and the latest version
> worked fine in a couple of tests.

Hello Eric,

yes, this was convoluted and the solution was one good night of sleep.

> About your second test case, is there any concrete reason to sleep for
> 0.4s ?

It's just to give some breathing time to the engine, with fast storages this is too much, but with a storage like ruote-couch or if you're encoding mkvs in the background, it might be necessary. Since I run the tests with all the storages, I go for a safe value.

Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on

return wfid if wfid && process(wfid) != nil
# process is already running

?

wfid0 is encountered, but process(wfid0) returns nil (for now).

I will probably expand that to something like

if wfid
sleep 0.400
return wfid if process(wfid)
end

or

2.times { |i| sleep(i * 0.350); return wfid if process(wfid) } if wfid

but well... Still the risk is here.

Best regards,

Eric Platon

unread,

Sep 28, 2010, 1:33:00 AM9/28/10

to ruote

Hi John,

> yes, this was convoluted and the solution was one good night of sleep.

Ah, sleeping is often the key to many problems!

> It's just to give some breathing time to the engine, with fast storages this is too much, but with a storage like ruote-couch or if you're encoding mkvs in the background, it might be necessary. Since I run the tests with all the storages, I go for a safe value.

Thank you for the explanation.

> Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on

My project requires a single engine. I feel safe against such a
scenario, but the problem exists, yes.

Instead of iterating in the hope to get a unique name, how about
having launch_single also store the engine name aside the wfid ?

Eric

John Mettraux

unread,

Sep 28, 2010, 10:32:14 PM9/28/10

to openwfe...@googlegroups.com

On Mon, Sep 27, 2010 at 10:33:00PM -0700, Eric Platon wrote:
>
> > Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on
>
> My project requires a single engine. I feel safe against such a
> scenario, but the problem exists, yes.
>
> Instead of iterating in the hope to get a unique name, how about
> having launch_single also store the engine name aside the wfid ?

Hello Eric,

this would tell me that another engine did the launch, but maybe the process is dead (and needs to be relaunched).

Still thinking about this issue.

Thanks for the help,

John Mettraux

unread,

Sep 30, 2010, 12:33:17 AM9/30/10

to openwfe...@googlegroups.com

On Wed, Sep 29, 2010 at 11:32:14AM +0900, John Mettraux wrote:
>
> On Mon, Sep 27, 2010 at 10:33:00PM -0700, Eric Platon wrote:
> >
> > > Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on
> >
> > My project requires a single engine. I feel safe against such a
> > scenario, but the problem exists, yes.
> >
> > Instead of iterating in the hope to get a unique name, how about
> > having launch_single also store the engine name aside the wfid ?
>

> this would tell me that another engine did the launch, but maybe the process is dead (and needs to be relaunched).
>
> Still thinking about this issue.

Hello Eric,

I went with this patch

http://github.com/jmettraux/ruote/commit/0880debd868628138bc5497c45637851ce92157f

it doesn't attempt to relaunch if the wfid is less that 1 second old or pointing to a running process.

Shout if you feel it's wrong.

Cheers,

Eric Platon

unread,

Oct 4, 2010, 9:47:18 PM10/4/10

to ruote

Hello John,

I guess the latest solution will work in many cases, but it also looks
like a mine that can be hard to detect in the future when debugging…

The problem pertains to several engines using the same storage---and
perhaps to the wfid generation scheme (?). There are two situations
then:
1) The engines work on the same wfid (meaning they need to share data)
=> It does not make sense (imho). The worker abstraction would be used
on a single engine to that end. It may move the problem to workers…
2) The engines work on different wfids => They can use the same
storage, but for different records. They may share process definitions
in the storage, but they do not share process instances owing to their
independence.

That may lead to restructuring the storage or, just a guess, labeling
process instances with an engine id so as to filter them.

You referred to the case where the process dies. I am not sure to get
it properly, but a dead process still leaves instance information in
the storage. Assuming there is an engine id in that information, can't
we do something that way? One problem considering the current commit
(@0880deb) is that the above requires extending the cloche#get
function to include filters. Is that a way to engage? I do not know
well document-based data layers...

Eric

On Sep 30, 1:33 pm, John Mettraux <jmettr...@openwfe.org> wrote:
> On Wed, Sep 29, 2010 at 11:32:14AM +0900, John Mettraux wrote:
>
> > On Mon, Sep 27, 2010 at 10:33:00PM -0700, Eric Platon wrote:
>
> > > > Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on
>
> > > My project requires a single engine. I feel safe against such a
> > > scenario, but the problem exists, yes.
>
> > > Instead of iterating in the hope to get a unique name, how about
> > > having launch_single also store the engine name aside the wfid ?
>
> > this would tell me that another engine did the launch, but maybe the process is dead (and needs to be relaunched).
>
> > Still thinking about this issue.
>
> Hello Eric,
>
> I went with this patch
>

> http://github.com/jmettraux/ruote/commit/0880debd868628138bc5497c4563...

John Mettraux

unread,

Oct 4, 2010, 10:14:03 PM10/4/10

to openwfe...@googlegroups.com

On Mon, Oct 04, 2010 at 06:47:18PM -0700, Eric Platon wrote:
>
> I guess the latest solution will work in many cases, but it also looks
> like a mine that can be hard to detect in the future when debugging…
>
> The problem pertains to several engines using the same storage---and
> perhaps to the wfid generation scheme (?).

Hello Eric,

thanks for taking the time to reflect on those issues, it's welcome. (I've seen that you've worked with distributed systems a lot).

First let me try to fix some vocabulary issues. They are induced by the change from ruote pre-2.1.x to ruote 2.1.x.

A ruote engine is now no more than a dashboard to a ruote system. A ruote system is a set with 1 storage and 1+ workers.

The workers fetch msgs and schedules from the storage and execute the msgs immediately, and the schedules if it's the time. The first step before execution is a second reservation step. So the bottleneck of the system is the storage, whose correct implementation is required to avoid collisions.

> There are two situations then:
>
> 1) The engines work on the same wfid (meaning they need to share data)
> => It does not make sense (imho). The worker abstraction would be used
> on a single engine to that end. It may move the problem to workers…
>
> 2) The engines work on different wfids => They can use the same
> storage, but for different records. They may share process definitions
> in the storage, but they do not share process instances owing to their
> independence.

The wfid generator is implemented so that not two engines (dashboards) can draw the same wfid, this is backed by the storage.

The engines (dashboards) use the same storage. If two engines use different storages, they are part of two different ruote systems.

> That may lead to restructuring the storage or, just a guess, labeling
> process instances with an engine id so as to filter them.

Unfortunately, "engine id" should be renamed to "ruote system id" to be more accurate.

I haven't changed it, since it's seldom used. Most of the people go with one engine (one ruote system), but with some googling you'll find me suggesting multi-system deployments quite a few times.

On the other side of the spectrum, you'll find people running 1 ruote system for 1 process instance (it's an interesting idea, after all, that's how we use classical interpreters).

> You referred to the case where the process dies. I am not sure to get
> it properly, but a dead process still leaves instance information in
> the storage. Assuming there is an engine id in that information, can't
> we do something that way? One problem considering the current commit
> (@0880deb) is that the above requires extending the cloche#get
> function to include filters. Is that a way to engage? I do not know
> well document-based data layers...

I'm now not sure if your question still applies after my clarification.

If a 'single' process is stuck in an error or stuck because a participant is not responding, relaunching it will probably end up in a new stuck process, that's why I consider that if engine.process(wfid) returns something (a process exist), I should not re-attempt to launch.

The administrator of the system has to detect the issue (engine.errors) and find a solution for it (IMHO).

I hope it's not too confusing, cheers,

Reply all

Reply to author

Forward