Salut Eric,
it was nice meeting you at the Ruby meetup.
> I am evaluating how Ruote deals with crashes at the workflow level.
> Resumption works, with an unexpected side-effect: The engine does
> resume a crashed process, but it also runs the workflow once more!
>
> I am probably doing something wrong. Here is a contrived example to
> show the problem.
I have the impression it's only a simple misunderstanding, how about this version of the quickstart :
The launch part has been modified like this :
---8<---
WFID_FILE = 'ruote_quickstart_wfid.txt'
wfid = File.read(WFID_FILE).strip rescue nil
wfid ||= engine.launch(pdef)
File.open(WFID_FILE, 'wb') { |f| f.write(wfid) }
engine.wait_for(wfid)
# blocks current thread until our process instance terminates
FileUtils.rm(WFID_FILE)
--->8---
I use variations of this pattern when I want to ensure that only 1 instance of a specific process definition runs at some point (usually it's a process that involves a cron expression ( http://ruote.rubyforge.org/exp/cron.html ).
Usually ruote is used inside of a web application or as a standalone worker and it's supposed to never exit. Processes are running and so on. Processes are not automatically launched as the webapp or the worker starts, they are launched (a bit) later on, triggered by an external event.
Sorry if the quickstart is misleading.
I'd suggest looking at ruote-kit
http://github.com/tosch/ruote-kit
for a web application (rack) that wraps ruote with a decent web interface. Maybe it's easier to get a feel for "ruote, the service" in that context.
So, it's not a bug, the two launch requests were honoured.
Cheers !
--
John Mettraux - http://jmettraux.wordpress.com
Hello Eric,
I've been (slowly) preparing something about that :
I hope to finish it by tomorrow.
It's true that the wfid file trick is not adapted to environments where engines in different Ruby runtimes launch such "unique" processes, hence my launch_single() work.
Stay tuned,
Hello Eric,
thanks for your idea, I've refined the launch_single method into :
http://github.com/jmettraux/ruote/commit/b1d1046b60b4a11ef3b0cb4cc83b368870c34854
The tests look like :
http://github.com/jmettraux/ruote/blob/ruote2.1/test/functional/ft_46_launch_single.rb
Not super happy with the 'single[s]' appellation, but I didn't want to use 'singleton'... Well I could have, I already have 'instances'...
Thanks again,
Hello Eric,
yes, this was convoluted and the solution was one good night of sleep.
> About your second test case, is there any concrete reason to sleep for
> 0.4s ?
It's just to give some breathing time to the engine, with fast storages this is too much, but with a storage like ruote-couch or if you're encoding mkvs in the background, it might be necessary. Since I run the tests with all the storages, I go for a safe value.
Speaking of things occuring at different paces, I think there might still be an issue in the current implementation : what happens if engine A just launched wfid0, and engine B stumbles on
return wfid if wfid && process(wfid) != nil
# process is already running
?
wfid0 is encountered, but process(wfid0) returns nil (for now).
I will probably expand that to something like
if wfid
sleep 0.400
return wfid if process(wfid)
end
or
2.times { |i| sleep(i * 0.350); return wfid if process(wfid) } if wfid
but well... Still the risk is here.
Best regards,
Hello Eric,
this would tell me that another engine did the launch, but maybe the process is dead (and needs to be relaunched).
Still thinking about this issue.
Thanks for the help,
Hello Eric,
I went with this patch
http://github.com/jmettraux/ruote/commit/0880debd868628138bc5497c45637851ce92157f
it doesn't attempt to relaunch if the wfid is less that 1 second old or pointing to a running process.
Shout if you feel it's wrong.
Cheers,
Hello Eric,
thanks for taking the time to reflect on those issues, it's welcome. (I've seen that you've worked with distributed systems a lot).
First let me try to fix some vocabulary issues. They are induced by the change from ruote pre-2.1.x to ruote 2.1.x.
A ruote engine is now no more than a dashboard to a ruote system. A ruote system is a set with 1 storage and 1+ workers.
The workers fetch msgs and schedules from the storage and execute the msgs immediately, and the schedules if it's the time. The first step before execution is a second reservation step. So the bottleneck of the system is the storage, whose correct implementation is required to avoid collisions.
> There are two situations then:
>
> 1) The engines work on the same wfid (meaning they need to share data)
> => It does not make sense (imho). The worker abstraction would be used
> on a single engine to that end. It may move the problem to workers…
>
> 2) The engines work on different wfids => They can use the same
> storage, but for different records. They may share process definitions
> in the storage, but they do not share process instances owing to their
> independence.
The wfid generator is implemented so that not two engines (dashboards) can draw the same wfid, this is backed by the storage.
The engines (dashboards) use the same storage. If two engines use different storages, they are part of two different ruote systems.
> That may lead to restructuring the storage or, just a guess, labeling
> process instances with an engine id so as to filter them.
Unfortunately, "engine id" should be renamed to "ruote system id" to be more accurate.
I haven't changed it, since it's seldom used. Most of the people go with one engine (one ruote system), but with some googling you'll find me suggesting multi-system deployments quite a few times.
On the other side of the spectrum, you'll find people running 1 ruote system for 1 process instance (it's an interesting idea, after all, that's how we use classical interpreters).
> You referred to the case where the process dies. I am not sure to get
> it properly, but a dead process still leaves instance information in
> the storage. Assuming there is an engine id in that information, can't
> we do something that way? One problem considering the current commit
> (@0880deb) is that the above requires extending the cloche#get
> function to include filters. Is that a way to engage? I do not know
> well document-based data layers...
I'm now not sure if your question still applies after my clarification.
If a 'single' process is stuck in an error or stuck because a participant is not responding, relaunching it will probably end up in a new stuck process, that's why I consider that if engine.process(wfid) returns something (a process exist), I should not re-attempt to launch.
The administrator of the system has to detect the issue (engine.errors) and find a solution for it (IMHO).
I hope it's not too confusing, cheers,