We recently upgraded to the (unreleased) ruote 2.2.1 (because we
needed the upgrades in ruote-redis and redis-rb) and ever since the
worker can't find some of the participants for the workflow
intermittently... this happens on average for about 1 out of 50 or so
expressions...
The full stack trace from ruote kit is:
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/exp/fe_participant.rb:141:in `apply'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/exp/flow_expression.rb:319:in
`do_apply'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/worker.rb:381:in `launch'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/worker.rb:284:in `process'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/worker.rb:198:in `step'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/worker.rb:93:in `run'
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/
gems/ruote-b66b34f1315c/lib/ruote/worker.rb:104:in `block in
run_in_thread'
The weird thing is, that replaying the expression fixes it and the
workitem happily continues its way through the workflow.
Any ideas?
For completeness, here's our stack:
* ruote - 2.2.1 (from github) - ref
b66b34f1315c4bc99acec8d5db07e62160a4b896
* ruote-redis - 2.2.1 (from github) - ref
cdb5712aa05f8717962898f9a9a3cd283d5de971
* ruote-kit - 2.2.1 (from github) - ref
191817aa92715b40b377ce0c28293382a6c47efd
* redis-rb - 2.2.2
Thanks for your help.
Cheers,
Tom
Hello Tom,
what is the error message ?
What version of Ruby are you using (1.9 something, but more precisely) ?
Best regards,
--
John Mettraux - http://lambda.io/processi
Hello again,
what version of Redis ? Any special setting ? What operating system ? Is the application on the same host as Redis ?
Thanks in advance,
Thanks for the quick reply.
Here are a few more details:
- Error Message:
ArgumentError: no participant named 'some_participant_name' found
- Ruby: 1.9.2p180
- Redis: 2.2.2 (00000000:0)
- Operating System: Ubuntu 10.04
- We have two workers that do all the hard work... they're on a
different server than redis
The tricky thing here is, that we updated to the latest version of
ruote AND moved the Redis DB to a different host at the same time - so
we don't know what's the issue here... but we could move redis back if
you reckon that could be the issue!
Other than that I reckon we're pretty standard.
Cheers,
Tom
Hello Tom,
thanks for the extra info. Here are a few extra questions, if they lead nowhere I'll probably cook a small test script, run it on my side and then pass it to you for running in your environment.
Do those "1 in 50" occur when the workers are on ? Would it occur when one worker starts (with the error occurring in the already running worker) ?
Do you guys register participant on the fly (ie not at worker startup, could be from code or from your process definition) ?
Cheers,
Hey John.
About the rate... this is kinda outdated now - we updated ruote with
the fix you did to the participant list and the rate seemed to have
dropped significantly... it might now be 1 out of 200 or so... just
FYI, we got a really simple workitem in that gets rewinded every 20
seconds and touches a file (which works as a performance check for
ruote for us), and it seems to get 'stuck' every 3-4 hours or so (with
the error mentioned above).
I'm not sure if I can answer your second question, but given that we
restart the workers on every deploy, I assume the # of workers (and
the order they're started in) doesn't really matter...
And no, all participants are registered before we start the workers...
Thanks heaps,
Tom
OK,
I ran this small script on my Mac (snoleo) with redis-2.2.2 (compiled manually) and couldn't get it to score nils (redis and client on the same host). How does it score in your environment ?
(https://gist.github.com/1392979)
---8<---
$:.unshift('../ruote/lib')
$:.unshift('../ruote-redis/lib')
require 'rufus-json/automatic'
require 'redis'
require 'ruote'
require 'ruote-redis'
storage = Ruote::Redis::Storage.new('db' => 15, 'thread_safe' => true)
storage.put( # pp output
{"type"=>"configurations",
"_id"=>"participant_list",
"list"=>
[["^.+$",
["Ruote::BlockParticipant",
{"on_workitem"=>
"proc do |workitem|\n context.tracer << workitem.participant_name + \"\\n\"\n end"}]]]})
count = 0
n = 100_000
t = Time.now
n.times do |i|
list = storage.get_configuration('participant_list')
count = count + 1 if list == nil
end
puts "#{count} in #{n} is nil"
puts "took #{Time.now - t} seconds"
--->8---
It seems your deployment has nil coming instead of the participant list from time to time (around here https://github.com/jmettraux/ruote/blob/master/lib/ruote/svc/participant_list.rb#L316-324).
We could harden that (or ruote-redis), but I find it puzzling the issue doesn't manifest itself when getting expressions (it would lead to all sorts of quirks).
I wonder what the script will say in your environment.
Hi John.
I hope you had a good weekend!
I ran the script multiple times and it never returned nil either...
which is extremely strange! I reckon we'll try and isolate the issue
on our end. We'll roll back to the previous ruote and redis versions
in order to find out if that's the issue, or the remote redis server.
Thanks,
Tom
I'm an idiot! I totally forgot that we have the system deployed on 2
servers (with the same configuration), BUT, one uses a local redis,
the other uses a remote redis, and it happens on both servers! So the
problem is somewhere in the gem setup...
I'll try and isolate it by updating only one gem at a time (as much as
possible)! Hopefully we can at least find out whether the problem is
in ruote or in redis-rb...
I'll keep you updated!
Cheers,
Tom
So we've been digging around a bit, trying different setups and so
on...
We reverted one of our servers to the old setup and kept one with the
new versions...
Interestingly, it didn't happen at all yesterday, BUT (maybe instead)
we got a lot of exceptions from redis...
undefined method `inject' for #<String:0x0000000733b5d8>
/srv/www/jasperdocs-production/shared/bundle/ruby/1.9.1/bundler/gems/
ruote-redis-c7efcd8a0802/lib/ruote/redis/storage.rb:240:in `get_many'
ruote (2.2.0) lib/ruote/part/storage_participant.rb:317:in
`fetch_all'
ruote (2.2.0) lib/ruote/part/storage_participant.rb:260:in `query'
We have no idea what's going on here - our assumption is, that we have
one (or many) stale redis connections. The interesting bit is, that
restarting passenger fixes the issue. So we think passenger starts
getting stale. The thing we couldn't figure out yet, is if the problem
is with redis-rb or ruote-redis...
I know that's a very vague description and everything, I was just
wondering if you've seen anything like that before...
Thanks,
Tom
Hello Tom,
so the String issue occurs here:
What does logging say redis is returning as a String ?
I've had such an issue previously, here's how I solved it:
https://github.com/jmettraux/ruote-redis/commit/cdb5712aa05f8717962898f9a9a3cd283d5de971
Anything that is returned that isn't an array is replaced with an empty array. It works well [with the the tests]. IIRC it only occurred in the tests tear down (I should double check, maybe it's a connection stale/closed issue in fact).
Seems it happens with mget too.
Maybe, this line
https://github.com/jmettraux/ruote-redis/blob/master/lib/ruote/redis/storage.rb#L287
could be replaced with something like
---8<---
docs = ids.length > 0 && @redis.mget(*ids)
docs = docs.is_a?(Array) ? docs : []
--->8---
> We have no idea what's going on here - our assumption is, that we have
> one (or many) stale redis connections. The interesting bit is, that
> restarting passenger fixes the issue. So we think passenger starts
> getting stale. The thing we couldn't figure out yet, is if the problem
> is with redis-rb or ruote-redis...
>
> I know that's a very vague description and everything, I was just
> wondering if you've seen anything like that before...
Sorry, no experience with stale connection. I have to admit I don't use redis in production.
Maybe we could add some connection check/re-establishment code if necessary.
I will dig further this get/mget issue here and come back to you. Meanwhile if you have more feedback, it'd be welcome.
I wonder about the overlap between connection stale and keys/mget returning a string instead of an array.
>
> so the String issue occurs here:
>
> https://github.com/tricycle/ruote-redis/blob/c7efcd8a080233aa7ba68423...
>
> What does logging say redis is returning as a String ?
Yes, that's exactly the place... surprisingly the logging seems to
catch valid results, just as a string, eg...
Queries Id's: ["expressions/0_0_0!e823b353d603379895d07bb338fec833!
20111025-majeguyu"]
Result (String):
expressions/0_0_0_0!f864fa66e92cd1e923b56f4f5ecab96a!20111025-majeguyu
Queries Id's: ["01:08:16.410770", "01:08:16.412251", "UTC\",\"_rev\":
\"1\"}", "UTC\"},\"wf_name\":\"pay_invoice\",\"wf_revision\":null,\"fei
\":{\"engine_id\":\"engine\",\"wfid\":\"20111127-bojujiboni\",\"subid
\":\"0534cec4eac18e57b27b66e935e3a241\",\"expid\":\"0_0_0_0\"},
\"participant_name\":\"ocr_validator\",\"re_dispatch_count\":0,\"type
\":\"workitems\",\"_id\":\"wi!0_0_0_0!0534cec4eac18e57b27b66e935e3a241!
20111127-bojujiboni\",\"wfid\":\"20111127-bojujiboni\",\"put_at\":
\"2011-11-29", "{\"fields\":{\"document_id\":28103,\"organization_id\":
119,\"requires_organization_picker\":false,\"__tags__\":[\"main\"],
\"data_entered\":true,\"supplier_id\":11,\"valid_gl_accounts\":true,
\"requires_supplier_creator\":false,\"requires_gl_account_creator
\":false,\"requires_ocr_validation\":true,\"params\":{\"ref\":
\"ocr_validator\",\"timeout\":\"7d\"},\"dispatched_at\":\"2011-11-29"]
Result (String):
{"fields":{"document_id":28205,"organization_id":
43,"requires_organization_picker":false,"__tags__":
["main"],"data_entered":true,"supplier_id":
111,"valid_gl_accounts":true,"requires_supplier_creator":false,"requires_gl_account_creator":false,"requires_ocr_validation":true,"params":
{"ref":"ocr_validator","timeout":"7d"},"dispatched_at":"2011-11-28
01:03:38.258097 UTC"},"wf_name":"pay_invoice","wf_revision":null,"fei":
{"engine_id":"engine","wfid":"20111128-
befujatsuno","subid":"054e6dd4ff89359f044b9e76827492a9","expid":"0_0_0_0"},"participant_name":"ocr_validator","re_dispatch_count":
0,"type":"workitems","_id":"wi!0_0_0_0!
054e6dd4ff89359f044b9e76827492a9!20111128-
befujatsuno","wfid":"20111128-befujatsuno","put_at":"2011-11-28
01:03:38.258797 UTC","_rev":"1"}
>
> I've had such an issue previously, here's how I solved it:
>
> https://github.com/jmettraux/ruote-redis/commit/cdb5712aa05f871796289...
>
We changed our copy of the repo to make sure this doesn't return a
string anymore (thanks for the guidance)...
https://github.com/tricycle/ruote-redis/commit/44f869ec2d277a6d99d5546c43ec1a306e0a361c
The problem now is, that we don't know in case stuff goes wrong...
There must be an underlying issue somewhere (and the more I read the
ruote-redis codebase, I reckon it must be in redis-rb).
>
> Sorry, no experience with stale connection. I have to admit I don't use redis in production.
>
> Maybe we could add some connection check/re-establishment code if necessary.
>
> I will dig further this get/mget issue here and come back to you. Meanwhile if you have more feedback, it'd be welcome.
Did you get anywhere with your research?
>
> I wonder about the overlap between connection stale and keys/mget returning a string instead of an array.
It would be an interesting correlation... but it's incredibly hard to
test, because it's an intermittent problem and I haven't come across a
scenario where it happens every time...
Cheers,
Tom
Hello Tom,
I removed the the I had added to keys_to_a and tried again:
---8<---
def @redis.keys_to_a(pattern)
#if (a = (keys(pattern) rescue nil)).is_a?(Array)
# a
#else
# []
#end
keys(pattern) rescue []
end
--->8---
But I did encounter the issue. (MacOSX snoleo, redis-rd 2.2.2, redis 2.2.2, ruote[-redis] edge, ruby 1.8.7p249 and ruby 1.9.2p290, tried the whole functional test suite thrice per ruby version).
I'm tempted to remove the check, I have the impression it was with previous versions of redis-rb/redis.
Tell me if you need me to test with a redis on a distinct host and I'll take the time to try (hopefully reproducing the issue).
Hi John.
Hope you had a good weekend... I don't have any hard evidence, but the
check was definitely valuable for us. I tried to add some logging to
it - which failed - but it told me that redis was returning Strings at
times, even in #keys_to_a
>
> Tell me if you need me to test with a redis on a distinct host and I'll take the time to try (hopefully reproducing the issue).
Not sure if that's going to help... and I'm still not any smarter on
which system is the actual the problem (redis, redis-rb or ruote-
redis)... We assume it's redis, but then again, that's just a guess.
Thanks,
Tom
Hello Tom,
I had a good weekend, I hope you had a good one too.
I had a second look at #keys_to_a and noticed I added it recently. I will
keep this defensive code around.
> > Tell me if you need me to test with a redis on a distinct host and I'll
> > take the time to try (hopefully reproducing the issue).
>
> Not sure if that's going to help... and I'm still not any smarter on
> which system is the actual the problem (redis, redis-rb or ruote-
> redis)... We assume it's redis, but then again, that's just a guess.
I've pushed the "defensive mget":
https://github.com/jmettraux/ruote-redis/commit/0e01a2e585e554e734f35808e58b20a273248c15
I hope it helps.
For the initial issue you reported, if it reappears, I wonder how we could
harden ruote-redis to deal with that. The symptom is an empty participant
list, the redis "get" returned nil.
Maybe you could monkey-patch ruote-redis with something like to force a
second "get":
---8<---
class Ruote::Redis::Storage
def get_configuration(key)
2.times do
if r = get('configurations', key)
return r
else
r
end
end
end
end
--->8---
Please keep me informed, cheers,
Yeah, I had a good weekend as well - Thanks.
>
> I've pushed the "defensive mget":
>
> https://github.com/jmettraux/ruote-redis/commit/0e01a2e585e554e734f35...
>
> I hope it helps.
Thanks for that. We gonna stick with our solution (with the logging)
for now - just in case we need the additional info:
https://github.com/tricycle/ruote-redis/tree/logging
>
> For the initial issue you reported, if it reappears, I wonder how we could
> harden ruote-redis to deal with that. The symptom is an empty participant
> list, the redis "get" returned nil.
Yeah, about the initial issue... since we rolled back to ruote (and
ruote-redis 2.2.0) this issue never came up again - so there is
definitely a bug somewhere in the new code... unfortunately we can't
really upgrade (and test) again, because this caused us serious issues
with our live app. But as you can see from our ruote-redis branch
(https://github.com/tricycle/ruote-redis/tree/logging), we cherry
picked almost everything, except basically this commit:
https://github.com/jmettraux/ruote-redis/commit/a73a1b9bb9847805dafc2da0eaa30cf6ebe8035c
And I don't think that commit messes with the participant list !?! So
it could be an issue with ruote itself... Unfortunately, but I wasn't
able to find a commit that could've introduced an issue - mainly
because there are heaps (good work on the updates btw) !
Cheers,
Tom