Workitem disappearing...

32 views
Skip to first unread message

David Powell

unread,
Dec 15, 2011, 6:26:28 PM12/15/11
to ruote
Greetings,

We are using ruote-kit with ruote-redis and seem to be hitting a bug.
We have a simple participant that touches a file, waits 20s, then
"rewinds". This works fine for a while, but then usually once a day
it stops. The "/_ruote" endpoint shows the "process" still exists,
but there is no entry for it in "workitems" nor "schedule", and
nothing is reported on the "error" page.

Has anyone else experience a similar problem? Any suggestions on how
I should go about tracking down the issue?


Details of the our gem versions:
gem 'ruote', '~> 2.2.0'
gem 'ruote-kit', :git => 'http://github.com/kennethkalmer/ruote-
kit.git', :ref => '191817aa92715b'
gem 'redis', '2.2.2'
gem 'ruote-redis', :git => 'https://github.com/jmettraux/ruote-
redis.git', :ref => 'cdb5712aa'


Cheers,

John Mettraux

unread,
Dec 15, 2011, 8:54:03 PM12/15/11
to openwfe...@googlegroups.com

On Thu, Dec 15, 2011 at 03:26:28PM -0800, David Powell wrote:
>
> We are using ruote-kit with ruote-redis and seem to be hitting a bug.
> We have a simple participant that touches a file, waits 20s, then
> "rewinds". This works fine for a while, but then usually once a day
> it stops. The "/_ruote" endpoint shows the "process" still exists,
> but there is no entry for it in "workitems" nor "schedule", and
> nothing is reported on the "error" page.
>
> Has anyone else experience a similar problem? Any suggestions on how
> I should go about tracking down the issue?

Hello David,

welcome to ruote's mailing list.

It'd be interesting to know exactly where it stops. It sounds like a "msg"
got lost.

Sometimes it happens with corner cases in participant implementation, like in

---8<---
class TouchParticipant
include Ruote::LocalParticipant

def consume(workitem)
FileUtils.touch(workitem.fields['filename'])
reply_to_engine(workitem)
rescue
puts "something went wrong"
end
end
--->8---

If anything goes wrong in that participant, a message will be emitted to the
STDOUT and reply_to_engine will not be seen.

Maybe it stops around your participant implementation. Placing some logging
output may help.

It could also happen for some mysterious reason (defect in my implementation
of ruote-redis for example), helping determine where it really stops would
help.

A way of fixing that is to re-apply the expression that was supposed to
reply. In ruote-kit this can be done from the dashboard (see attached image).
It's only a repair trick, it requires admin intervention.


I'd be glad to help determining the cause.

Best regards,

--
John Mettraux - http://lambda.io/processi

re_apply.png

David Powell

unread,
Dec 19, 2011, 4:16:16 PM12/19/11
to ruote

Thanks for the prompt reply John. I added the rescue+logging as you
suggested to our TouchParticipant. However, the workitem still just
"stops" and the log is not triggered so something else is going wrong
it seems. The re-apply trick does get it working again, but is
obviously not ideal.

I'm keen to try and pin down the problem. Unfortunately, this problem
only occurs about every 24hrs, so debugging will be a pain. Do you
have any suggestions on where to start looking?

Cheers,

-- David Powell

John Mettraux

unread,
Dec 19, 2011, 11:41:37 PM12/19/11
to openwfe...@googlegroups.com

On Mon, Dec 19, 2011 at 01:16:16PM -0800, David Powell wrote:
>
> Thanks for the prompt reply John. I added the rescue+logging as you
> suggested to our TouchParticipant. However, the workitem still just
> "stops" and the log is not triggered so something else is going wrong
> it seems. The re-apply trick does get it working again, but is
> obviously not ideal.
>
> I'm keen to try and pin down the problem. Unfortunately, this problem
> only occurs about every 24hrs, so debugging will be a pain. Do you
> have any suggestions on where to start looking?

Hello David,

I should have asked that earlier, but does it stop always at the same point?
Is there a pattern emerging somehow?

With a 24h frequency we'd better turn some more logging on and wait.
One idea would be to 'instrument' the storage a bit, like in:

---8<---
require 'pp'

class PowellStorage < Ruote::Redis::Storage

def initialize(redis, options={})
@log = File.open("ruote_redis_#{$$}.log", 'wb')
super
end

def put_msg(action, options)
@log.puts('p=' * 40)
PP.pp([ action, options ], @log)
super
end

def get_msgs
msgs = super
@log.puts('=g' * 40)
msgs.each { |msg| PP.pp(msg, @log); @log.puts('- ' * 40) }
msgs
end
end
--->8---

If the messages get in (put_msg) but don't get out (get_msgs) that would
incriminate ruote-redis or redis.

There are other techniques for watching/archiving the engine/workers
activity. The one above is very straightforward, it produces too much info.
This might be better (and more greppable):

---8<---
class PowellStorage < Ruote::Redis::Storage

def initialize(redis, options={})
@log = File.open("ruote_redis_#{$$}.log", 'wb')
super
end

def put_msg(action, options)
@log.puts([ 'in', action, options['fei'].inspect ].join(' '))
super
end

def get_msgs
msgs = super
msgs.each { |msg|
@log.puts([ 'out', msg['action'], msg['fei'].inspect ].join(' '))
}
msgs
end
end
--->8---

Adding the msg['put_at'] could help too.

Does it stop always at the same point? Is there a pattern emerging somehow?


Cheers,

David Powell

unread,
Dec 23, 2011, 12:35:04 AM12/23/11
to openwfe...@googlegroups.com
On Tue, Dec 20, 2011 at 3:41 PM, John Mettraux <jmet...@gmail.com> wrote:

Does it stop always at the same point? Is there a pattern emerging somehow?



Thanks for the suggestions John.  I put in logging as you suggested in your second example and have run that for the last 2 days.  Unfortunately, the problem has not reoccurred since the logging was in place!

The logging is too verbose to leave in place over the holiday break, so I'll have to take it out and look at it again in the new year.

Thanks again for your help,

-- David Powell

David Powell

unread,
Jan 19, 2012, 4:42:21 PM1/19/12
to openwfe...@googlegroups.com
We are still having some trouble with this which we have not been able to diagnose.

Another issue, which I suspect is related is that a workitem may become "stuck" as I mentioned above.  We can often "re-apply" the expression and have it resume.  However, sometimes after we "reapply", the "_ruote" web interface will show the expression as "cancelling" and it will just stay in that state.  Once it shows "cancelling" we cannot reapply, nor kill it.  Any suggestions about how to further investigate this problem would be appreciated.

Cheers,

-- David Powell


John Mettraux

unread,
Jan 19, 2012, 5:31:57 PM1/19/12
to openwfe...@googlegroups.com

On Fri, Jan 20, 2012 at 08:42:21AM +1100, David Powell wrote:
>
> We are still having some trouble with this which we have not been able to
> diagnose.

Hello David,

ouch too bad.

> Another issue, which I suspect is related is that a workitem may become
> "stuck" as I mentioned above. We can often "re-apply" the expression and
> have it resume. However, sometimes after we "reapply", the "_ruote" web
> interface will show the expression as "cancelling" and it will just stay in
> that state. Once it shows "cancelling" we cannot reapply, nor kill it.
> Any suggestions about how to further investigate this problem would be
> appreciated.

"re-apply" will cancel an expression and its children and once it's done, it
will apply the expression again (with the same tree and workitem or with a
new tree and/or workitem).

So at its heart "re-apply" uses the cancel mechanism. An expression begin
re-applied shows as "cancelling" and cannot be further cancelled. Killing
should go over and place the expression (and its children) in the "dying"
state.

Is that related to that participant:

On Thu, Dec 15, 2011 at 03:26:28PM -0800, David Powell wrote:
>
> We have a simple participant that touches a file, waits 20s, then
> "rewinds".

May I see the code of this participant ?

What does the expression/tree you're re-applying look like ? Is the
re-applied expression a parent expression to the call to the participant
mentioned above ?

David Powell

unread,
Jan 19, 2012, 7:45:44 PM1/19/12
to openwfe...@googlegroups.com
Thanks for your reply John - response inline below.

On Fri, Jan 20, 2012 at 9:31 AM, John Mettraux <jmet...@gmail.com> wrote:

> Another issue, which I suspect is related is that a workitem may become
> "stuck" as I mentioned above.  We can often "re-apply" the expression and
> have it resume.  However, sometimes after we "reapply", the "_ruote" web
> interface will show the expression as "cancelling" and it will just stay in
> that state.  Once it shows "cancelling" we cannot reapply, nor kill it.
> Any suggestions about how to further investigate this problem would be
> appreciated.

"re-apply" will cancel an expression and its children and once it's done, it
will apply the expression again (with the same tree and workitem or with a
new tree and/or workitem).

So at its heart "re-apply" uses the cancel mechanism. An expression begin
re-applied shows as "cancelling" and cannot be further cancelled. Killing
should go over and place the expression (and its children) in the "dying"
state.

Is that related to that participant:

On Thu, Dec 15, 2011 at 03:26:28PM -0800, David Powell wrote:
>
> We have a simple participant that touches a file, waits 20s, then
> "rewinds".

We have noticed this type of issue with different participants, so does not seem participant specific.  It also occurs with this participant which is the simplest one so perhaps is the best to focus on for now.



May I see the code of this participant ?

Sure:

module Participant
  class FileToucher
    include Ruote::LocalParticipant

    def initialize(opts = {})
    end

    def consume(workitem)
      @workitem = workitem
      FileUtils.touch [File.join(Rails.root, 'public', 'ruote_speed.txt')]
      reply_to_engine workitem
    rescue => e
      Rails.logger.error "Error in FileToucher : #{e}"
    end

    def cancel(fei, flavour)
      # do nothing
    end

  end
end

 

What does the expression/tree you're re-applying look like ? Is the
re-applied expression a parent expression to the call to the participant
mentioned above ?



The reapply expression :
fields:
{
}
tree:
[
  "cursor",
  {
  },
  [
    [
      "participant",
      {
        "file_toucher": null
      },
      [

      ]
    ],
    [
      "wait",
      {
        "20s": null
      },
      [

      ]
    ],
    [
      "rewind",
      {
      },
      [

      ]
    ]
  ]
]

 An example of /_ruote  output:

fei expid exp applied state
20111207-bonuhopidi 0 0 define 2012/01/19 15:18:54
20111207-bonuhopidi 0_0 0_0 cursor 2012/01/19 15:18:56 cancelling


Thanks again for your help with this.

-- David Powell

John Mettraux

unread,
Jan 19, 2012, 8:11:44 PM1/19/12
to openwfe...@googlegroups.com

On Fri, Jan 20, 2012 at 11:45:44AM +1100, David Powell wrote:
>
> > May I see the code of this participant ?
>
> module Participant
> class FileToucher
> include Ruote::LocalParticipant
>
> def initialize(opts = {})
> end
>
> def consume(workitem)
> @workitem = workitem
> FileUtils.touch [File.join(Rails.root, 'public', 'ruote_speed.txt')]
> reply_to_engine workitem
> rescue => e
> Rails.logger.error "Error in FileToucher : #{e}"
> end
>
> def cancel(fei, flavour)
> # do nothing
> end
>
> end
> end

Hello David,

@workitem = workitem

can be safely removed it seems.

As mentioned in

http://groups.google.com/group/openwferu-users/msg/570fd511f0bc7791

the flow will get stuck if there is an error.

---8<---
def consume(workitem)


FileUtils.touch [File.join(Rails.root, 'public', 'ruote_speed.txt')]

rescue => e
Rails.logger.error "Error in FileToucher : #{e}"

ensure
reply_to_engine workitem
end
--->8---

might be better.

That's consistent with the initial symptoms you described back in December...
Workitem disappearing.


I hope this helps, best regards,

David Powell

unread,
Jan 19, 2012, 11:51:09 PM1/19/12
to openwfe...@googlegroups.com
On Fri, Jan 20, 2012 at 12:11 PM, John Mettraux <jmet...@gmail.com> wrote:
I hope this helps, best regards,


Thanks for the suggestion John we have tried that, but that doesn't seem to be the problem.

We have done some digging in the ruote code and pinned down why we cannot 'reapply' some expressions.  The problem is that some expressions have "children" but those children don't exist in the storage.  So, when it tries to re-apply, it issues a "cancel" which propagates down through the children.  But then, it attempts to cancel a child that does not exist thus the child never issues a reply_to_parent.

We were able to fix our problematic workitems by writing some code that runs through each child and checks if it really exists.  Something like the following, which I'm sure is rough:

  def remove_nonexistent_children(exp)
    cs = exp.children
    ok = cs.select do |child|
      child_exp = RuoteKit.engine.fetch_flow_expression(child)
      if child_exp
        remove_nonexistent_children(child_exp)
        true
      else
        puts "Removing : #{child}"
        false
      end
    end

    if ok != cs
      exp.h["children"] = ok
      exp.persist_or_raise
    end
  end

  Document.stalled.each do |doc|   
    puts "Re-applying document #{doc.id}"
    fexp = RuoteKit.engine.process(doc.wfid).expressions.last
    remove_nonexistent_children(fexp)
    RuoteKit.engine.re_apply(fexp.fei)
  end




The root cause of the problem is still unclear.  The evidence points to the fact that some messages are lost by the redis backend - although I don't know where.  We have noticed when we attempt to launch many workitems at once, then invariably some go missing.  We have done tests using sequel as a backend and this seems to solve that issue.  We have tried setting 'do_not_thread' on our all participants which did reduce the problem somewhat but didn't solve it completely.

At this stage we are going to migrate to the sequel backend instead.  Thanks for your help.

Cheers,
--
David Powell

John Mettraux

unread,
Jan 19, 2012, 11:56:30 PM1/19/12
to openwfe...@googlegroups.com

On Fri, Jan 20, 2012 at 03:51:09PM +1100, David Powell wrote:
>
> The root cause of the problem is still unclear. The evidence points to the
> fact that some messages are lost by the redis backend - although I don't
> know where. We have noticed when we attempt to launch many workitems at
> once, then invariably some go missing. We have done tests using sequel as
> a backend and this seems to solve that issue. We have tried setting
> 'do_not_thread' on our all participants which did reduce the problem
> somewhat but didn't solve it completely.

Hello again David,

then I need to do some stress testing on ruote-redis. You told me you were
using ruote-redis cdb5712aa and redis 2.2.2, what version of redis itself are
you using and on which platform ?

Thanks in advance !

David Powell

unread,
Jan 20, 2012, 12:04:03 AM1/20/12
to openwfe...@googlegroups.com
We are running redis-server 2:2.2.2-1ubuntu1  on ubuntu 10.04.

In production we have redis on a separate server, but that does not seem to affect this problem.

Cheers,
--
David Powell


 

John Mettraux

unread,
Jan 31, 2012, 11:59:23 PM1/31/12
to openwfe...@googlegroups.com

On Fri, Jan 20, 2012 at 04:04:03PM +1100, David Powell wrote:
>
> We are running redis-server 2:2.2.2-1ubuntu1 on ubuntu 10.04.
>
> In production we have redis on a separate server, but that does not seem to
> affect this problem.

Hello David,

I'm looking at the redis issue, trying with 1.8.7p249 for now.

I realized I never asked you which version of Ruby you were using when faced
with the ruote + redis issue. It'd be great if you could tell me.


Thanks in advance,

David Powell

unread,
Feb 1, 2012, 12:38:20 AM2/1/12
to openwfe...@googlegroups.com
On Wed, Feb 1, 2012 at 3:59 PM, John Mettraux <jmet...@gmail.com> wrote:

On Fri, Jan 20, 2012 at 04:04:03PM +1100, David Powell wrote:
>
> We are running redis-server 2:2.2.2-1ubuntu1  on ubuntu 10.04.
>
> In production we have redis on a separate server, but that does not seem to
> affect this problem.

Hello David,

I'm looking at the redis issue, trying with 1.8.7p249 for now.

I realized I never asked you which version of Ruby you were using when faced
with the ruote + redis issue. It'd be great if you could tell me.



Hi John,

We are using :

ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

Cheers,

-- David

John Mettraux

unread,
Feb 1, 2012, 1:16:19 AM2/1/12
to openwfe...@googlegroups.com

On Wed, Feb 01, 2012 at 04:38:20PM +1100, David Powell wrote:
>
> We are using :
>
> ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

Hello David,

I think I could reproduce the issue with this script:

https://gist.github.com/1715445

and Ruby 1.8.7p249 and 1.9.2p290.

With sleep 0.025, it doesn't lose launch messages, but with sleep 0.001 or no
sleep at all, it does.

I need to find a workaround.


Thanks again,

David Powell

unread,
Feb 1, 2012, 4:19:09 PM2/1/12
to openwfe...@googlegroups.com
Hello David,

I think I could reproduce the issue with this script:

 https://gist.github.com/1715445

and Ruby 1.8.7p249 and 1.9.2p290.

With sleep 0.025, it doesn't lose launch messages, but with sleep 0.001 or no
sleep at all, it does.

I need to find a workaround.


Hi John,

Glad to hear you can reproduce the problem.  It seemed like for us that it wasn't just on launch that there were problems, but whenever there were several messages going to the redis store in quick succession.  Looking forward to hearing what the cause of this issue is!

Cheers,

-- David


John Mettraux

unread,
Feb 3, 2012, 1:15:39 AM2/3/12
to openwfe...@googlegroups.com
Hello David,

I found the cause of the problem.

The fix replaces:

---8<---
def get_msgs

msgs = []

while (doc = @redis.lpop('msgs')) and (msgs.size < 21)
msgs << from_json(doc)
end

msgs
end
--->8---

with:

---8<---
def get_msgs

doc = @redis.lpop('msgs')

doc ? [ from_json(doc) ] : []
end
--->8---

Usually (at low loads), the 21 limit is never reached. At high loads, the
limit is reached easily, dropping 1 message in 21.

https://github.com/jmettraux/ruote-redis/commit/8b5e847874fb13753541e197de806ba312e739c8


I'm sorry for all the time you wasted because of my stupidity.

David Powell

unread,
Feb 3, 2012, 3:11:34 AM2/3/12
to openwfe...@googlegroups.com
On Fri, Feb 3, 2012 at 5:15 PM, John Mettraux <jmet...@gmail.com> wrote:
I found the cause of the problem.


Ahh, nice work John.
 
 

I'm sorry for all the time you wasted because of my stupidity.


No problem, happens to the best of us.  Thank you for your interest in solving the problem, it is appreciated.

Cheers,

-- David Powell 
Reply all
Reply to author
Forward
0 new messages