On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote:
>
> I ran into an interesting situation with a workflow. Our project
> manager built a workflow and was trying to do something that was
> 'legal' in ruote but it ended up creating an endless rewind condition.
> The net result was the rewind ran for about 6 hours creating 1.5
> million audit entries.
Ouch.
> Obviously this was not his intent, besides telling him not to do that
> again it brought me back to my old instrumentation questions. When we
> see ruote break it is usually one of the following things.
Let me still how I think each case should be handled. I understand the quick solutions I mention are not applicable in all the cases.
> 1.) Somebody built a bad workflow.
It should fail as soon as possible with an error logged.
> 2.) A participant died in an unexpected way.
An error should be logged.
> 3.) A participant tried to do something that took a long time.
Timeouts could help.
> 4.) Someone, or something killed a worker while it was working.
If it results in a workflow error then the workflow can be replayed at the error or the fauly branch can get re-applied.
> 5.) We don't have enough workers running.
The "engine" is visibly slow.
> We currently use newrelic to let us peek into what the workers are
> doing but that does not give us enough info.
>
> It is pretty easy for us to build a watchdog to govern the number of
> history items that are created an shut them down if someone goes
> crazy, but I was wondering if you had a better way.
>
> Also I was looking back at an old thread on fault tolerance and was
> wondering if you have given any thought to this:
>
> http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949
>
> Specifically, letting workers 'talk' to the engine.
Since the engine is the sum of workers, let's translate that to "the workers somehow write in the storage some info about their existence and their activity".
I went back to this previous conversation. Here is what I extracted from one of your posts:
| I think this type of problem will continue to cause issues around
| fault tolerance and instrumentation. You should be able to ask the
| engine how many workers are running, how many are consuming. You
| should be able to pause or stop the workers.
About fault tolerance, I can only preconize manual or "on_error" error handling, ie letting your administrators peak at the error list frequently enough.
Quick general reminder (and teaser for ruote 2.2.1):
Every ruote service that responds to the #on_msg(msg) method will see that method get called for each message the worker it lives with sucessfully processes.
---8<---
class ErrorNotifier
def initialize(context, opts={})
@new_relic = NewRelic.new(...)
end
def on_msg
return unless msg['action'] == 'error_intercepted'
@new_relic.emit(msg)
end
end
--->8---
| You should be able to ask the engine how many workers are running,
| how many are consuming.
So how about a "document" shared by all workers where they list:
- hostname, pid
- uptime
- msgs processed during last week/day/hour/minute
- timestamp
(what am I missing ?)
With a Engine#status method to query that document ?
| You should be able to pause or stop the workers.
engine.pause_workers, engine.resume_workers and engine.stop_workers ?
Do you need to pause one specific worker or a specific set of workers ?
Thanks for the reminder, the reporting feature is easy to add, but I had forgotten it. I was (am still) stuck on the "remotely pause/resume/stop workers" idea.
--
John Mettraux - http://lambda.io/processi
Hello,
I've added a Dashboard (Engine) #worker_info method. It returns information about the workers.
I'm probably going to add some more information to that.
For now it looks like:
---8<---
{"10.0.1.2/34710"=>
{"pid"=>34710,
"processed_last_minute"=>1,
"class"=>"Ruote::Worker",
"put_at"=>"2011-09-19 12:32:33.881352 UTC",
"system"=>
"Darwin sanma.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386",
"processed_last_hour"=>1,
"wait_time_last_hour"=>0.004534,
"ip"=>"10.0.1.2",
"uptime"=>0.008638,
"hostname"=>"sanma.local",
"wait_time_last_minute"=>0.004534}}
--->8---
I'm still thinking about how to tell the workers to stop without having them poll too much and how to do it so that it works with the different storage implementations.
Please tell me if there is something that needs to be revised or if something got forgotten.
Thanks in advance,
I'm still working on {pause|resume|stop}_workers, it's currently in a
local branch, looking good, I just have trouble finding time to work
on it, but I hope by the end of the week.
I was thinking about the worker_info, would it be OK if I wiped info
about workers that haven't replies in the last hour (or last 24 hours)
? I don't wont this list to become 99% dead workers.
What do you think ?
John
Hello Eric,
thanks for the feedback, no worries for the delay.
The memory size is an excellent idea, adding it to my todo list.
Cheers,
--
you received this message because you are subscribed to the "ruote users" group.
to post : send email to openwfe...@googlegroups.com
to unsubscribe : send email to openwferu-use...@googlegroups.com
more options : http://groups.google.com/group/openwferu-users?hl=en
Hello Mario,
Engine/Dashboard#worker_info is giving you information about ruote workers, which are alive (ip, pid, workload, memory, last time seen).
Are you suggesting ruote should collect such information from remote participants as well ?
Let me reformulate my question to Eric: we have multiple ruote workers (!= participant) and they each update worker_info like every minute. At some point worker dies and are not replaced. How long should we keep worker_info (about dead workers) 1 month, 1 day, 5 minutes ?
24h seems OK.
Thanks for clarifying your idea.
Kind regards,
John
Hello Eric,
it seems like it took me two months to reach the end of the week.
It's merged into the main branch:
https://github.com/jmettraux/ruote/commit/3e19e8f47dda2d58f4a2ac73436f418579af3a52
There are 3 states: "running" (default), "paused" and "stopped".
By default it's disabled, you have to set "worker_state_enabled" to
true when initializing the storage, like in:
worker = Ruote::Worker.new(Ruote::HashStorage.new('worker_state_enabled'
=> true))
to make it work.
I apologize for the delay. Feedback is welcome,
E.
On Dec 7, 5:19 am, John Mettraux <jmettr...@gmail.com> wrote:
> 2011/10/2 John Mettraux <jmettr...@gmail.com>:
>
>
>
> > I'm still working on {pause|resume|stop}_workers, it's currently in a
> > local branch, looking good, I just have trouble finding time to work
> > on it, but I hope by the end of the week.
>
> Hello Eric,
>
> it seems like it took me two months to reach the end of the week.
>
> It's merged into the main branch:
>
> https://github.com/jmettraux/ruote/commit/3e19e8f47dda2d58f4a2ac73436...