New release and future plans

4 views
Skip to first unread message

Mathias Meyer

unread,
Apr 21, 2010, 10:01:54 AM4/21/10
to nanite
Hey,

just wanted to spread the word that there's been a minor release for
Nanite today (0.4.1.17), which is mostly of importance if you're using
the Redis backend for state storage. It fixes deprecation warnings
with newer versions of the Redis library, and an inconsistency when
dealing with intermediate results.

More importantly though, I'd like to lay out what else happened, and
what I'm planning on working on in the coming weeks.

Most of what I have on my list is dealing with reliability, since
right now there's simply a risk of losing messages when agents or
mappers suddenly go down. Messages in Nanite are ack'd before they are
delivered, and therefore lost when an unrecovered error occurs or the
process dies.

I started working on improving that on the agent part, though it's not
yet fully done. In a branch I introduced a feature that will allow
shutting down agents gracefully. When they still have work to do, and
you send them SIGINT or SIGTERM, they'll just disconnect from the
broker, clean up their pidfile, but won't shut down until all their
work is done. Due to the asynchronous nature of EventMachine I'm
relying on the user to tell me when he's done working through a
method. Right now I'm not assuming that a the actor method returning
means the task is done. That is mainly due to how we use it, because
we have actor methods kicking off longer running tasks that poll with
periodic timers. So that's up for discussion. I'll build something
similar for mappers, but less sophisticated, basically just waiting
for a grace period to fully shut down the process, not killing message
processing right in between.

We're using the branch code in production, and it's stable, so I'll
probably merge it soon, and make it a release 0.4.2.

Other than that, I'd like to improve the reliability of message
delivery itself. I might rely more on Redis to achieve that, but I'm
still playing with ideas. Basically I want to remove the ack before
dispatch in actors, and forge it together with the reliability stuff
outlined above, relying on the user or a returned method to know that
a method was done, redelivering to another agent after a certain time
has elapsed. Both could be options, and timeouts for redelivery could
also be something to include. I'm peeking at Beetle [2] for
inspiration here, because that's something Nanite is missing
currently. When a message is gone, it's gone. That's sometimes
acceptable, but I'd like to improve on that, because at least for us,
it's not.

Let me know if you have comments or other things you'd like to see
improved or fixed in Nanite. I'm more than happy to fully outline the
ideas once I have well, a good idea on them.

Cheers, Mathias

[1] http://github.com/ezmobius/nanite/tree/exit_hooks
[2] http://xing.github.com/beetle/
--
http://paperplanes.de
http://twitter.com/roidrage

--
You received this message because you are subscribed to the Google Groups "Nanite" group.
To post to this group, send email to nan...@googlegroups.com.
To unsubscribe from this group, send email to nanite+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nanite?hl=en.

Kyle Burton

unread,
May 17, 2010, 3:54:56 PM5/17/10
to Nanite
> Most of what I have on my list is dealing with reliability, since
> right now there's simply a risk of losing messages when agents or
> mappers suddenly go down. Messages in Nanite are ack'd before they are
> delivered, and therefore lost when an unrecovered error occurs or the
> process dies.

I noticed that in the code - it's great that it's being worked on.
Thank you.

I've also noticed some behavior I'd like to get some feedback on
surrounding the offline queue behavior. Explained in more detail
below...

> Other than that, I'd like to improve the reliability of message
> delivery itself. I might rely more on Redis to achieve that, but I'm
> still playing with ideas. Basically I want to remove the ack before
> dispatch in actors, and forge it together with the reliability stuff
> outlined above, relying on the user or a returned method to know that
> a method was done, redelivering to another agent after a certain time
> has elapsed. Both could be options, and timeouts for redelivery could
> also be something to include. I'm peeking at Beetle [2] for
> inspiration here, because that's something Nanite is missing
> currently. When a message is gone, it's gone. That's sometimes
> acceptable, but I'd like to improve on that, because at least for us,
> it's not.

This would effectively require the actor itself to signal the agent
that it has processed the message and that it should be acked -
otherwise the message would be relinquished back to rabbit for re-
delivery, correct?


> [2]http://xing.github.com/beetle/

Thank you for the beetle link, it looks like Beetle and Nanite may
overlap quite a bit in functionality - do you see a major difference
in their goals or feature sets? Beetle requires a Redis server and
assumes clustered brokers - any other major differences I'm not
seeing?

I've tracked the behavior I was talking about down to the conjunction
of a few configuration options along with a simple agent and mapper.
The pertinent configuration options are:

:prefetch => 1,
:offline_failsafe => true,
:offline_redelivery_frequency => 30,

The prefetch of 1 was set in both the agent and the mapper. I started
a mapper and made about a dozen requests, which all ended up in the in
offline queue - I could see this by using 'rabbitmqctl list_queues -p /
nanite'. Also, at this point the mapper's offline queue handler
pulled the first message off of that queue, and, having no agents to
sent it to yet, did nothing. Next I started my agent, which sent a
registration message and a ping message. list_queues showed both the
registration message as well as the ping messages backing up in
RabbitMQ - the pings in the heartbeat queue. The mapper never
received the register or heartbeat messages.

What appears to be happening is that the mapper is repeatedly getting
messages from the offline queue and never receiving messages from the
other queues - it appears to have the correct bindings and connections/
channels. This is manifesting itself as the mapper just hanging and
not doing anything while the agent awaits jobs from the mapper - and
the two never communicate with each other.

If I up the prefetch (for the mapper) to a number larger than the
number of messages that happen to be in the offline queue, then
everything unblocks and processing continues.

I've reproduced this on both Linux and OSX using the following gems:

amqp-0.6.7
eventmachine-0.12.10
nanite-0.4.1.17

Do you know if this is a limitation of the amqp library? Of Rabbit?
Is there a better way of handling this kind of reliability of
messaging? Any pointers or help will be appreciated.

Thank you for all your efforts. Best Regards,

Kyle
Reply all
Reply to author
Forward
0 new messages