'stuck' processes with empty state & expression in 'failing' state

Matthew York

unread,

Feb 6, 2014, 3:56:21 PM2/6/14

to openwfe...@googlegroups.com

Hello,

We've been using the latest version of ruote w/ ruote-mon using 2 workers for about 6 months now.

Over time as our process definitions have become more complex and longer running they have also become less reliable.

Many processes are getting 'stuck' - where they never enter the error state and also fail to respond to cancel.

I’ve been Using Ruote-kit to monitor and clean up these processes which usually works.

In the case where a process is 'stuck' and I attempt to kill it, the process changes to the 'dying' state, and never gets removed from the list.

This seems to happen around calls to subprocesses where I attempt to use the ‘pass’ expression for on_error and on_timeout:

cursor :timeout => '${v:timeout}', :on_timeout => :pass, :tag => 'wait_for_fqdn_discovery' do

get_machine_fqdn

sequence :unless => '${f:machine_fqdn}' do

log 'waiting 60s' => '${f:machine.machine_id}'

wait '60s'

rewind

end

refresh_state :on_error => 'pass'

Am I doing this incorrectly?

Matthew York

unread,

Feb 6, 2014, 4:35:13 PM2/6/14

to openwfe...@googlegroups.com

After some more testing - we tried just using a single worker which seems to be way more stable.

John Mettraux

unread,

Feb 6, 2014, 4:59:30 PM2/6/14

to openwfe...@googlegroups.com

> On Thursday, February 6, 2014 12:56:21 PM UTC-8, Matthew York wrote:
> >
> > We've been using the latest version of ruote w/ ruote-mon using 2 workers
> > for about 6 months now.
> >
> > Over time as our process definitions have become more complex and longer
> > running they have also become less reliable.

Hello Matthew,

Combined with the "we just tried using a single worker which seems to be way
more stable", I'd say there's something tiny something wrong in one
expression that fails sometimes and the sometimes do accumulate.

Or simply a problem with ruote-mon.

> > Many processes are getting 'stuck' - where they never enter the error
> > state and also fail to respond to cancel.
> >
> > I’ve been Using Ruote-kit to monitor and clean up these processes which
> > usually works.
> >
> > In the case where a process is 'stuck' and I attempt to kill it, the
> > process changes to the 'dying' state, and never gets removed from the list.

It'd be interesting to know how the dying state propagates in the stuck
process expression trees.

> > This seems to happen around calls to subprocesses where I attempt to use
> > the ‘pass’ expression for on_error and on_timeout:
>
> > cursor :timeout => '${v:timeout}', :on_timeout => :pass, :tag =>
> > 'wait_for_fqdn_discovery' do
> >
> > get_machine_fqdn
> >
> > sequence :unless => '${f:machine_fqdn}' do
> >
> > log 'waiting 60s' => '${f:machine.machine_id}'
> >
> > wait '60s'
> >
> > rewind
> >
> > end
> >
> > end
> >
> > refresh_state :on_error => 'pass'
>
> > Am I doing this incorrectly?

It looks OK. Maybe http://ruote.io/common_attributes.html#on_error_composing
could help (or bring more "stuckage").

On Thu, Feb 06, 2014 at 01:35:13PM -0800, Matthew York wrote:
> After some more testing - we tried just using a single worker which seems
> to be way more stable.

I could let you go with running one worker and hope for the best. Or I could
press for more information and locate and fix the issue. Or help you locate
and fix the issue, in ruote and/or in ruote-mon.

Kind regards,

John

Matthew York

unread,

Feb 6, 2014, 5:39:07 PM2/6/14

to openwfe...@googlegroups.com

Hello John - Thanks for the response

I liked the idea of having 2 workers using the same storage, but the main goal was to provide some redundancy.

The processes queuing up work for ruote are behind activeMQ, so I suppose there is still redundancy even if I end up splitting the 2 workers to use separate storage.

Rather than troubleshoot ruote-mon I was thinking about trying the redis provider to see if it exhibits the same behavior.

For now I'll move ahead with a single worker & attempt to find whatever other issues I may be having with the process definitions themselves.

Later on if I want to continue to troubleshoot is using http://ruote.rubyforge.org/noisy.html the only way to go? I recall having issues trying to get this to work.

Thanks,

--Matt

John Mettraux

unread,

Feb 6, 2014, 7:01:17 PM2/6/14

to openwfe...@googlegroups.com

On Thu, Feb 06, 2014 at 02:39:07PM -0800, Matthew York wrote:
>
> (...)

>
> goal was to provide some redundancy.
> The processes queuing up work for ruote are behind activeMQ, so I suppose
> there is still redundancy even if I end up splitting the 2 workers to use
> separate storage.
>
> Rather than troubleshoot ruote-mon I was thinking about trying the redis
> provider to see if it exhibits the same behavior.
> For now I'll move ahead with a single worker & attempt to find whatever
> other issues I may be having with the process definitions themselves.
>
> Later on if I want to continue to troubleshoot is
> using http://ruote.rubyforge.org/noisy.html the only way to go? I recall
> having issues trying to get this to work.

Hello,

I can't remember you mentioning having trouble with getting noisy to work ;-)

I used(d) noisy a lot when developing expression and when debugging process
definition. Along with test/rspec it's great to see what a worker is doing.
When there are two workers working on the same process instance, the noisy
output gets split, making it difficult to interpret without some shuffling.

Best regards,

John

Matthew York

unread,

Feb 28, 2014, 5:41:59 PM2/28/14

to openwfe...@googlegroups.com

Hi John,

So some strangeness today:

I have a bit of process def that has this:

    sequence :unless => '$f:machine.remote_id' do

I've encountered a scenario where this comparison fails - from my logs I see a dump of f:machine that does have 'remote_id' populated just before this process is run.

Somehow the sequence is still executed.

Note that this particular process is nested under a concurrent_iterator - and in the other branches 29 other machines (properly) skipped this sequence, but one did not.

I've looked over some of the older posts about comparisons, and I don't think i'm doing anything too strange here.

Do you think this would be an issue with ruote-mon? ( which I'm still using with 1 worker )

I haven't tried to replicate this issue using another storage.

I'm just now starting to attempt to replicate this issue with a simple test case, but something tells me this isn't simple.

--Thanks
Matt York

John Mettraux

unread,

Feb 28, 2014, 5:52:08 PM2/28/14

to openwfe...@googlegroups.com

On Fri, Feb 28, 2014 at 02:41:59PM -0800, Matthew York wrote:
>
> So some strangeness today:
>
> I have a bit of process def that has this:
>
> sequence :unless => '$f:machine.remote_id' do
>
> I've encountered a scenario where this comparison fails - from my logs I
> see a dump of f:machine that does have 'remote_id' populated just before
> this process is run.
> Somehow the sequence is still executed.
>
> Note that this particular process is nested under a concurrent_iterator -
> and in the other branches 29 other machines (properly) skipped this
> sequence, but one did not.
>
> I've looked over some of the older posts about comparisons, and I don't
> think i'm doing anything too strange here.
>
> Do you think this would be an issue with ruote-mon? ( which I'm still using
> with 1 worker )
> I haven't tried to replicate this issue using another storage.
>
> I'm just now starting to attempt to replicate this issue with a simple test
> case, but something tells me this isn't simple.

Hello Matt,

I'm sorry, from the information you are giving me, I cannot say much.

What is remote_id right before the sequence gets hold of it, could it be ""
or "false"? The comparison/unless code isn't that smart, it takes a decision,
but not a random decision.

You could had some debug output around line 368 or
lib/ruote/exp/flow_expression.rb or in the apply? method of class Condition
in lib/ruote/exp/condition.rb to see what's ending up there...

I hope this will help.

Best regards,

John

Matthew York

unread,

Feb 28, 2014, 6:02:35 PM2/28/14

to openwfe...@googlegroups.com

Hi John

Thanks for the quick reply - I added some more detailed info here: https://gist.github.com/stackdump/217b2c87f62540fa4807

I didn't include the process that logs f:machine before the sequence, but you can see the log entry just after the sequence begins running.

The machine hash already contained a remote_id:

2014-02-28 19:20:26  Env: 5310bda1e1d14826e00000bc Thread: 22454140  - Participants::Log: {"CREATING MACHINE"=>{"availability_zone"=>"nova", "flags"=>{"migrate"=>true, "wipe"=>true}, "flavor_id"=>"18", "image_id"=>"de0bf0b8-8f16-4e0e-bc23-cfc27c52283c", "machine_id"=>"5310be96749f06d9ef000051", "name"=>"mlb14-goo-balancer4", "puppet_role"=>"role::sdod::playerconnect", "remote_id"=>"020235ef-9541-46b3-9f05-b1832daf440d", "security_groups"=>["server.balancer"], "services"=>{"playcore"=>{"balancer"=>{}}}, "state"=>"ECO_CREATED", "tenant_name"=>"GPAD_SD1", "user_data"=>"application=mlb14&environment=production pe_eco_environment=ote pe_eco_message_broker=eco-ote-messaging.eco.usw1.cld.scea.com", "_id"=>"5310e13de1d1483dd90000c0"}, "ref"=>"log"}

Thanks for the debugging tip - I'll keep investigating.

--Thanks
Matt York

John Mettraux

unread,

Feb 28, 2014, 6:29:16 PM2/28/14

to openwfe...@googlegroups.com

On Fri, Feb 28, 2014 at 03:02:35PM -0800, Matthew York wrote:
>
> Thanks for the quick reply - I added some more detailed info here:
> https://gist.github.com/stackdump/217b2c87f62540fa4807
>
> I didn't include the process that logs f:machine before the sequence, but
> you can see the log entry just after the sequence begins running.
>
> The machine hash already contained a remote_id:
>
> 2014-02-28 19:20:26 Env: 5310bda1e1d14826e00000bc Thread: 22454140 - Participants::Log: {"CREATING MACHINE"=>{"availability_zone"=>"nova", "flags"=>{"migrate"=>true, "wipe"=>true}, "flavor_id"=>"18", "image_id"=>"de0bf0b8-8f16-4e0e-bc23-cfc27c52283c", "machine_id"=>"5310be96749f06d9ef000051", "name"=>"mlb14-goo-balancer4", "puppet_role"=>"role::sdod::playerconnect", "remote_id"=>"020235ef-9541-46b3-9f05-b1832daf440d", "security_groups"=>["server.balancer"], "services"=>{"playcore"=>{"balancer"=>{}}}, "state"=>"ECO_CREATED", "tenant_name"=>"GPAD_SD1", "user_data"=>"application=mlb14&environment=production pe_eco_environment=ote pe_eco_message_broker=eco-ote-messaging.eco.usw1.cld.scea.com", "_id"=>"5310e13de1d1483dd90000c0"}, "ref"=>"log"}

Hello Matt,

I've updated the test/unit/ut_6_condition.rb with:

```ruby
def test_unless

assert_apply :unless => 'true == false'
assert_skip :unless => 'false == false'

assert_skip :unless => 'true'
assert_skip :unless => '20235ef'
assert_skip :unless => '020235ef'
assert_skip :unless => '020235ef-9541-46b3-9f05-b1832daf440d'
end
```

and it fails with "020235ef".

I have to fix that: https://github.com/jmettraux/ruote/issues/93

Stay tuned.

John

John Mettraux

unread,

Feb 28, 2014, 6:55:29 PM2/28/14

to openwfe...@googlegroups.com

Ouch, it's tougher than I thought.

The ruby parser I use evalutes "020235ef" as *false*...

As a work around, I'd suggest to go with something like:

sequence :unless => "$f:x.y.z.remote_id is set"

My tests say that "020235ef is set" is considered true.

I consider closing the issue... Can't fix, I don't want to bother reporting
the issue upstream, RubyParser is pain (story too long).

Please try with the workaround.

Sorry about that.

John

Matthew York

unread,

Feb 28, 2014, 6:57:35 PM2/28/14

to openwfe...@googlegroups.com

Hi John,

Thanks for the help - I'll try with the workaround

--Matt

John

--
--
you received this message because you are subscribed to the "ruote users" group.
to post : send email to openwfe...@googlegroups.com
to unsubscribe : send email to openwferu-use...@googlegroups.com
more options : http://groups.google.com/group/openwferu-users?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "ruote" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openwferu-users/rBXaCeTtBig/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openwferu-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

http://pgp.stackdump.com/myork/7C32-5A3B-2C69-497D-9895-5BF4-893A-752A-AF70-426D.asc

Reply all

Reply to author

Forward