Rouge workflow

32 views
Skip to first unread message

eric smith

unread,
Sep 6, 2011, 5:49:56 PM9/6/11
to ruote
Hello all,

I ran into an interesting situation with a workflow. Our project
manager built a workflow and was trying to do something that was
'legal' in ruote but it ended up creating an endless rewind condition.
The net result was the rewind ran for about 6 hours creating 1.5
million audit entries.

Obviously this was not his intent, besides telling him not to do that
again it brought me back to my old instrumentation questions. When we
see ruote break it is usually one of the following things.

1.) Somebody built a bad workflow.
2.) A participant died in an unexpected way.
3.) A participant tried to do something that took a long time.
4.) Someone, or something killed a worker while it was working.
5.) We don't have enough workers running.

We currently use newrelic to let us peek into what the workers are
doing but that does not give us enough info.

It is pretty easy for us to build a watchdog to govern the number of
history items that are created an shut them down if someone goes
crazy, but I was wondering if you had a better way.

Also I was looking back at an old thread on fault tolerance and was
wondering if you have given any thought to this:

http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949

Specifically, letting workers 'talk' to the engine.

Thanks
Eric Smith

John Mettraux

unread,
Sep 6, 2011, 8:13:43 PM9/6/11
to openwfe...@googlegroups.com
Hello Eric,

On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote:
>
> I ran into an interesting situation with a workflow. Our project
> manager built a workflow and was trying to do something that was
> 'legal' in ruote but it ended up creating an endless rewind condition.
> The net result was the rewind ran for about 6 hours creating 1.5
> million audit entries.

Ouch.

> Obviously this was not his intent, besides telling him not to do that
> again it brought me back to my old instrumentation questions. When we
> see ruote break it is usually one of the following things.

Let me still how I think each case should be handled. I understand the quick solutions I mention are not applicable in all the cases.

> 1.) Somebody built a bad workflow.

It should fail as soon as possible with an error logged.

> 2.) A participant died in an unexpected way.

An error should be logged.

> 3.) A participant tried to do something that took a long time.

Timeouts could help.

> 4.) Someone, or something killed a worker while it was working.

If it results in a workflow error then the workflow can be replayed at the error or the fauly branch can get re-applied.

> 5.) We don't have enough workers running.

The "engine" is visibly slow.

> We currently use newrelic to let us peek into what the workers are
> doing but that does not give us enough info.
>
> It is pretty easy for us to build a watchdog to govern the number of
> history items that are created an shut them down if someone goes
> crazy, but I was wondering if you had a better way.
>
> Also I was looking back at an old thread on fault tolerance and was
> wondering if you have given any thought to this:
>
> http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949
>
> Specifically, letting workers 'talk' to the engine.

Since the engine is the sum of workers, let's translate that to "the workers somehow write in the storage some info about their existence and their activity".

I went back to this previous conversation. Here is what I extracted from one of your posts:

| I think this type of problem will continue to cause issues around
| fault tolerance and instrumentation. You should be able to ask the
| engine how many workers are running, how many are consuming. You
| should be able to pause or stop the workers.

About fault tolerance, I can only preconize manual or "on_error" error handling, ie letting your administrators peak at the error list frequently enough.

Quick general reminder (and teaser for ruote 2.2.1):

Every ruote service that responds to the #on_msg(msg) method will see that method get called for each message the worker it lives with sucessfully processes.

---8<---
class ErrorNotifier
def initialize(context, opts={})
@new_relic = NewRelic.new(...)
end
def on_msg
return unless msg['action'] == 'error_intercepted'
@new_relic.emit(msg)
end
end
--->8---


| You should be able to ask the engine how many workers are running,
| how many are consuming.

So how about a "document" shared by all workers where they list:

- hostname, pid
- uptime
- msgs processed during last week/day/hour/minute
- timestamp

(what am I missing ?)

With a Engine#status method to query that document ?


| You should be able to pause or stop the workers.

engine.pause_workers, engine.resume_workers and engine.stop_workers ?

Do you need to pause one specific worker or a specific set of workers ?


Thanks for the reminder, the reporting feature is easy to add, but I had forgotten it. I was (am still) stuck on the "remotely pause/resume/stop workers" idea.

--
John Mettraux - http://lambda.io/processi

Eric Smith

unread,
Sep 8, 2011, 12:22:28 PM9/8/11
to ruote

> engine.pause_workers, engine.resume_workers and engine.stop_workers ?
Yes, that would be great!

Combined with :

>So how about a "document" shared by all workers where they list:

>- hostname, pid
>- uptime
>- msgs processed during last week/day/hour/minute
>- timestamp

You could add 'status' to the document to know that the status was
paused, stopped or running.

Would you want to stop all workers, or each worker? ( For our use case
stop all is sufficient )

It would be nice to know how long the workeritem was waiting around
for a worker to get to it. That might be as meaningful as the number
of processes.

Thanks
Eric Smith
> >http://groups.google.com/group/openwferu-users/browse_thread/thread/c...

John Mettraux

unread,
Sep 19, 2011, 8:36:32 AM9/19/11
to openwfe...@googlegroups.com

On Thu, Sep 08, 2011 at 09:22:28AM -0700, Eric Smith wrote:
>
> > engine.pause_workers, engine.resume_workers and engine.stop_workers ?
> Yes, that would be great!
>
> Combined with :
>
> >So how about a "document" shared by all workers where they list:
>
> >- hostname, pid
> >- uptime
> >- msgs processed during last week/day/hour/minute
> >- timestamp
>
> You could add 'status' to the document to know that the status was
> paused, stopped or running.
>
> Would you want to stop all workers, or each worker? ( For our use case
> stop all is sufficient )
>
> It would be nice to know how long the workeritem was waiting around
> for a worker to get to it. That might be as meaningful as the number
> of processes.

Hello,

I've added a Dashboard (Engine) #worker_info method. It returns information about the workers.

I'm probably going to add some more information to that.

For now it looks like:

---8<---
{"10.0.1.2/34710"=>
{"pid"=>34710,
"processed_last_minute"=>1,
"class"=>"Ruote::Worker",
"put_at"=>"2011-09-19 12:32:33.881352 UTC",
"system"=>
"Darwin sanma.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386",
"processed_last_hour"=>1,
"wait_time_last_hour"=>0.004534,
"ip"=>"10.0.1.2",
"uptime"=>0.008638,
"hostname"=>"sanma.local",
"wait_time_last_minute"=>0.004534}}
--->8---

I'm still thinking about how to tell the workers to stop without having them poll too much and how to do it so that it works with the different storage implementations.

Please tell me if there is something that needs to be revised or if something got forgotten.


Thanks in advance,

Eric Smith

unread,
Sep 19, 2011, 5:15:09 PM9/19/11
to ruote
John, this looks like what we discussed, I assume that pid + ip
Identifies the worker.

Thanks very much.
Eric

John Mettraux

unread,
Oct 2, 2011, 7:10:38 AM10/2/11
to openwfe...@googlegroups.com
Hello Eric,

I'm still working on {pause|resume|stop}_workers, it's currently in a
local branch, looking good, I just have trouble finding time to work
on it, but I hope by the end of the week.

I was thinking about the worker_info, would it be OK if I wiped info
about workers that haven't replies in the last hour (or last 24 hours)
? I don't wont this list to become 99% dead workers.

What do you think ?

John

eric smith

unread,
Oct 4, 2011, 6:49:56 PM10/4/11
to ruote
Hello John,
Sorry for not getting back to you sooner, and yes I think a 24 hour
window will work. I can't think a case where a participant would take
24 to run.
I was also thinking it would be nice to have the memory size of the
worker in worker_info, but now I am probably pushing my luck.

Thanks
Eric

John Mettraux

unread,
Oct 4, 2011, 6:59:59 PM10/4/11
to openwfe...@googlegroups.com

On Tue, Oct 04, 2011 at 03:49:56PM -0700, eric smith wrote:
>
> Sorry for not getting back to you sooner, and yes I think a 24 hour
> window will work. I can't think a case where a participant would take
> 24 to run.
> I was also thinking it would be nice to have the memory size of the
> worker in worker_info, but now I am probably pushing my luck.

Hello Eric,

thanks for the feedback, no worries for the delay.

The memory size is an excellent idea, adding it to my todo list.


Cheers,

Mario Camou

unread,
Oct 4, 2011, 7:15:37 PM10/4/11
to openwfe...@googlegroups.com
I would make it configurable, say in config.ru (perhaps per participant type? That would probably make things too complex). I can think of at least one case where you might want more than 24 hours: interactive participants. So you call the participant, it notifies a user of some action that needs to take place, and doesn't reply until the user performs the action. In some use cases the user might take several days from the moment they receive the notification to the moment they perform the aciton.

-Mario.

--
I want to change the world but they won't give me the source code.


--
you received this message because you are subscribed to the "ruote users" group.
to post : send email to openwfe...@googlegroups.com
to unsubscribe : send email to openwferu-use...@googlegroups.com
more options : http://groups.google.com/group/openwferu-users?hl=en

John Mettraux

unread,
Oct 4, 2011, 7:42:04 PM10/4/11
to openwfe...@googlegroups.com

On Wed, Oct 05, 2011 at 01:15:37AM +0200, Mario Camou wrote:
>
> I would make it configurable, say in config.ru (perhaps per participant
> type? That would probably make things too complex). I can think of at least
> one case where you might want more than 24 hours: interactive participants.
> So you call the participant, it notifies a user of some action that needs to
> take place, and doesn't reply until the user performs the action. In some
> use cases the user might take several days from the moment they receive the
> notification to the moment they perform the aciton.

Hello Mario,

Engine/Dashboard#worker_info is giving you information about ruote workers, which are alive (ip, pid, workload, memory, last time seen).

Are you suggesting ruote should collect such information from remote participants as well ?

Let me reformulate my question to Eric: we have multiple ruote workers (!= participant) and they each update worker_info like every minute. At some point worker dies and are not replaced. How long should we keep worker_info (about dead workers) 1 month, 1 day, 5 minutes ?

24h seems OK.

Thanks for clarifying your idea.

Kind regards,

John

eric smith

unread,
Oct 5, 2011, 9:51:30 AM10/5/11
to ruote
John,
With the clarification I still think 24h is ok. I could make an
argument for 72h. The Agent dies over the course of a weekend and you
cant collect the data for some reason, but this argument feels pretty
weak. What ever the limit there could be a reason to extend it. We
intend to use the worker_info like a heart beat or a ping. I think the
time frame should be tied to the longest period that an non-human
participant could take to complete before the worker could beat again.
If the worker does not beat for 24hours, in my book its dead.

I assume that "put_at"=>"2011-09-19 12:32:33.881352 UTC" is the last
time the worker checked in?

Thanks
Eric Smith

Mario Camou

unread,
Oct 5, 2011, 9:38:48 PM10/5/11
to openwfe...@googlegroups.com

I see. I had it completely wrong, confusing workers and participants. Never mind, nothing to see here. Move along...

John Mettraux

unread,
Dec 7, 2011, 6:19:48 AM12/7/11
to openwfe...@googlegroups.com
2011/10/2 John Mettraux <jmet...@gmail.com>:

>
> I'm still working on {pause|resume|stop}_workers, it's currently in a
> local branch, looking good, I just have trouble finding time to work
> on it, but I hope by the end of the week.

Hello Eric,

it seems like it took me two months to reach the end of the week.

It's merged into the main branch:

https://github.com/jmettraux/ruote/commit/3e19e8f47dda2d58f4a2ac73436f418579af3a52

There are 3 states: "running" (default), "paused" and "stopped".

By default it's disabled, you have to set "worker_state_enabled" to
true when initializing the storage, like in:

worker = Ruote::Worker.new(Ruote::HashStorage.new('worker_state_enabled'
=> true))

to make it work.


I apologize for the delay. Feedback is welcome,

eric smith

unread,
Dec 8, 2011, 9:31:42 AM12/8/11
to ruote
Thanks John,
I will pull it down and test it out.

E.

On Dec 7, 5:19 am, John Mettraux <jmettr...@gmail.com> wrote:
> 2011/10/2 John Mettraux <jmettr...@gmail.com>:


>
>
>
> > I'm still working on {pause|resume|stop}_workers, it's currently in a
> > local branch, looking good, I just have trouble finding time to work
> > on it, but I hope by the end of the week.
>
> Hello Eric,
>
> it seems like it took me two months to reach the end of the week.
>
> It's merged into the main branch:
>

>  https://github.com/jmettraux/ruote/commit/3e19e8f47dda2d58f4a2ac73436...

Reply all
Reply to author
Forward
0 new messages