We are using ruote-kit with ruote-redis and seem to be hitting a bug.
We have a simple participant that touches a file, waits 20s, then
"rewinds". This works fine for a while, but then usually once a day
it stops. The "/_ruote" endpoint shows the "process" still exists,
but there is no entry for it in "workitems" nor "schedule", and
nothing is reported on the "error" page.
Has anyone else experience a similar problem? Any suggestions on how
I should go about tracking down the issue?
Details of the our gem versions:
gem 'ruote', '~> 2.2.0'
gem 'ruote-kit', :git => 'http://github.com/kennethkalmer/ruote-
kit.git', :ref => '191817aa92715b'
gem 'redis', '2.2.2'
gem 'ruote-redis', :git => 'https://github.com/jmettraux/ruote-
redis.git', :ref => 'cdb5712aa'
Cheers,
Hello David,
welcome to ruote's mailing list.
It'd be interesting to know exactly where it stops. It sounds like a "msg"
got lost.
Sometimes it happens with corner cases in participant implementation, like in
---8<---
class TouchParticipant
include Ruote::LocalParticipant
def consume(workitem)
FileUtils.touch(workitem.fields['filename'])
reply_to_engine(workitem)
rescue
puts "something went wrong"
end
end
--->8---
If anything goes wrong in that participant, a message will be emitted to the
STDOUT and reply_to_engine will not be seen.
Maybe it stops around your participant implementation. Placing some logging
output may help.
It could also happen for some mysterious reason (defect in my implementation
of ruote-redis for example), helping determine where it really stops would
help.
A way of fixing that is to re-apply the expression that was supposed to
reply. In ruote-kit this can be done from the dashboard (see attached image).
It's only a repair trick, it requires admin intervention.
I'd be glad to help determining the cause.
Best regards,
--
John Mettraux - http://lambda.io/processi
Thanks for the prompt reply John. I added the rescue+logging as you
suggested to our TouchParticipant. However, the workitem still just
"stops" and the log is not triggered so something else is going wrong
it seems. The re-apply trick does get it working again, but is
obviously not ideal.
I'm keen to try and pin down the problem. Unfortunately, this problem
only occurs about every 24hrs, so debugging will be a pain. Do you
have any suggestions on where to start looking?
Cheers,
-- David Powell
Hello David,
I should have asked that earlier, but does it stop always at the same point?
Is there a pattern emerging somehow?
With a 24h frequency we'd better turn some more logging on and wait.
One idea would be to 'instrument' the storage a bit, like in:
---8<---
require 'pp'
class PowellStorage < Ruote::Redis::Storage
def initialize(redis, options={})
@log = File.open("ruote_redis_#{$$}.log", 'wb')
super
end
def put_msg(action, options)
@log.puts('p=' * 40)
PP.pp([ action, options ], @log)
super
end
def get_msgs
msgs = super
@log.puts('=g' * 40)
msgs.each { |msg| PP.pp(msg, @log); @log.puts('- ' * 40) }
msgs
end
end
--->8---
If the messages get in (put_msg) but don't get out (get_msgs) that would
incriminate ruote-redis or redis.
There are other techniques for watching/archiving the engine/workers
activity. The one above is very straightforward, it produces too much info.
This might be better (and more greppable):
---8<---
class PowellStorage < Ruote::Redis::Storage
def initialize(redis, options={})
@log = File.open("ruote_redis_#{$$}.log", 'wb')
super
end
def put_msg(action, options)
@log.puts([ 'in', action, options['fei'].inspect ].join(' '))
super
end
def get_msgs
msgs = super
msgs.each { |msg|
@log.puts([ 'out', msg['action'], msg['fei'].inspect ].join(' '))
}
msgs
end
end
--->8---
Adding the msg['put_at'] could help too.
Does it stop always at the same point? Is there a pattern emerging somehow?
Cheers,
Does it stop always at the same point? Is there a pattern emerging somehow?
Hello David,
ouch too bad.
> Another issue, which I suspect is related is that a workitem may become
> "stuck" as I mentioned above. We can often "re-apply" the expression and
> have it resume. However, sometimes after we "reapply", the "_ruote" web
> interface will show the expression as "cancelling" and it will just stay in
> that state. Once it shows "cancelling" we cannot reapply, nor kill it.
> Any suggestions about how to further investigate this problem would be
> appreciated.
"re-apply" will cancel an expression and its children and once it's done, it
will apply the expression again (with the same tree and workitem or with a
new tree and/or workitem).
So at its heart "re-apply" uses the cancel mechanism. An expression begin
re-applied shows as "cancelling" and cannot be further cancelled. Killing
should go over and place the expression (and its children) in the "dying"
state.
Is that related to that participant:
On Thu, Dec 15, 2011 at 03:26:28PM -0800, David Powell wrote:
>
> We have a simple participant that touches a file, waits 20s, then
> "rewinds".
May I see the code of this participant ?
What does the expression/tree you're re-applying look like ? Is the
re-applied expression a parent expression to the call to the participant
mentioned above ?
> Another issue, which I suspect is related is that a workitem may become"re-apply" will cancel an expression and its children and once it's done, it
> "stuck" as I mentioned above. We can often "re-apply" the expression and
> have it resume. However, sometimes after we "reapply", the "_ruote" web
> interface will show the expression as "cancelling" and it will just stay in
> that state. Once it shows "cancelling" we cannot reapply, nor kill it.
> Any suggestions about how to further investigate this problem would be
> appreciated.
will apply the expression again (with the same tree and workitem or with a
new tree and/or workitem).
So at its heart "re-apply" uses the cancel mechanism. An expression begin
re-applied shows as "cancelling" and cannot be further cancelled. Killing
should go over and place the expression (and its children) in the "dying"
state.
Is that related to that participant:
On Thu, Dec 15, 2011 at 03:26:28PM -0800, David Powell wrote:
>
> We have a simple participant that touches a file, waits 20s, then
> "rewinds".
May I see the code of this participant ?
What does the expression/tree you're re-applying look like ? Is the
re-applied expression a parent expression to the call to the participant
mentioned above ?
| fei | expid | exp | applied | state |
| 20111207-bonuhopidi 0 | 0 | define | 2012/01/19 15:18:54 | |
| 20111207-bonuhopidi 0_0 | 0_0 | cursor | 2012/01/19 15:18:56 |
cancelling |
Hello David,
@workitem = workitem
can be safely removed it seems.
As mentioned in
http://groups.google.com/group/openwferu-users/msg/570fd511f0bc7791
the flow will get stuck if there is an error.
---8<---
def consume(workitem)
FileUtils.touch [File.join(Rails.root, 'public', 'ruote_speed.txt')]
rescue => e
Rails.logger.error "Error in FileToucher : #{e}"
ensure
reply_to_engine workitem
end
--->8---
might be better.
That's consistent with the initial symptoms you described back in December...
Workitem disappearing.
I hope this helps, best regards,
I hope this helps, best regards,
Hello again David,
then I need to do some stress testing on ruote-redis. You told me you were
using ruote-redis cdb5712aa and redis 2.2.2, what version of redis itself are
you using and on which platform ?
Thanks in advance !
Hello David,
I'm looking at the redis issue, trying with 1.8.7p249 for now.
I realized I never asked you which version of Ruby you were using when faced
with the ruote + redis issue. It'd be great if you could tell me.
Thanks in advance,
Hello David,
On Fri, Jan 20, 2012 at 04:04:03PM +1100, David Powell wrote:
>
> We are running redis-server 2:2.2.2-1ubuntu1 on ubuntu 10.04.
>
> In production we have redis on a separate server, but that does not seem to
> affect this problem.
I'm looking at the redis issue, trying with 1.8.7p249 for now.
I realized I never asked you which version of Ruby you were using when faced
with the ruote + redis issue. It'd be great if you could tell me.
Hello David,
I think I could reproduce the issue with this script:
https://gist.github.com/1715445
and Ruby 1.8.7p249 and 1.9.2p290.
With sleep 0.025, it doesn't lose launch messages, but with sleep 0.001 or no
sleep at all, it does.
I need to find a workaround.
Thanks again,
Hello David,
I think I could reproduce the issue with this script:
https://gist.github.com/1715445
and Ruby 1.8.7p249 and 1.9.2p290.
With sleep 0.025, it doesn't lose launch messages, but with sleep 0.001 or no
sleep at all, it does.
I need to find a workaround.
I found the cause of the problem.
The fix replaces:
---8<---
def get_msgs
msgs = []
while (doc = @redis.lpop('msgs')) and (msgs.size < 21)
msgs << from_json(doc)
end
msgs
end
--->8---
with:
---8<---
def get_msgs
doc = @redis.lpop('msgs')
doc ? [ from_json(doc) ] : []
end
--->8---
Usually (at low loads), the 21 limit is never reached. At high loads, the
limit is reached easily, dropping 1 message in 21.
https://github.com/jmettraux/ruote-redis/commit/8b5e847874fb13753541e197de806ba312e739c8
I'm sorry for all the time you wasted because of my stupidity.
I found the cause of the problem.
I'm sorry for all the time you wasted because of my stupidity.