nameko and HA rabbit

tsachi...@gmail.com

unread,

Nov 20, 2016, 6:45:21 AM11/20/16

to nameko-dev

Hi,
Our rabbitmq cluster consists of 3 servers clustered in HA-all policy.
Due to disk errors, out master node went down, and all our clients started throwing errors.
We expected them to connect to the elected slave, but that didn't happen.

The connection string in the conf.yaml file is as follows:
AMQP_URI='amqp://path_to_node1;amqp://path_to_node2;amqp://path_to_node3'

So, given that the first (master) node is down, we expected nameko (and its clients) to switch to the second node (the elected master).
But that didn't happen.
Furthermore, restarting the services also didn't work, since the master node is first on the list of URIs, and is down.

We expect that if the first connection is refused/broken, the next URI should be used.

So, how would you handle automatic reconnect to elected nodes once the master is down?

tsachi...@gmail.com

unread,

Nov 20, 2016, 8:48:18 AM11/20/16

to nameko-dev

I just found an issue on that: https://github.com/celery/kombu/issues/185 (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

Matt Yule-Bennett

unread,

Nov 20, 2016, 9:06:00 AM11/20/16

to nameko-dev, tsachi...@gmail.com

I wasn't aware that kombu supported this kind of connection params. Nameko has never explicitly supported it. From quickly scanning the kombu docs it looks like we should be round-robining between the provided URIs, but it's never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ) there's a load-balancer in front of the cluster, and that takes care of routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko isn't supporting round-robin connections out of the box.

tsachi...@gmail.com

unread,

Nov 20, 2016, 9:43:04 AM11/20/16

to nameko-dev

Well, a load balancer is a valid option, but the way that rabbitmq implements HA it creates unnecessary hops between nodes, thus adding latency.

The best solution, performance-wise (though a bad programming design) is that the clients would cycle through a list of nodes, given the main node is down.

It IS implelemented in kombu, but as mentioned, it has a bug, so it always tries to connect to the master, even if it is down.

Matt Bennett

unread,

Nov 20, 2016, 9:52:40 AM11/20/16

to nameko-dev, tsachi...@gmail.com

Is the bug in kombu or nameko? And can you reproduce it in a test case?

--
You received this message because you are subscribed to the Google Groups "nameko-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nameko-dev+...@googlegroups.com.
To post to this group, send an email to namek...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nameko-dev/c4697c43-127f-4493-867d-0d1c2d3c39ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tsachi...@gmail.com

unread,

Nov 20, 2016, 9:59:36 AM11/20/16

to nameko-dev

The bug is in kombu and can be easily reproduced.
1. AMQP_URI variable in conf.yaml: assign a semicolon-separated string of URIs. The master is the first one, followed by its slaves.
2. Start a nameko service.
3. Stop the master node.
4. Try to communicate with nameko.

Instead of kombu switching to the newly elected master, it will continue trying to connect to the original master, which is down.

Matt Yule-Bennett

unread,

Nov 20, 2016, 1:51:41 PM11/20/16

to nameko-dev, tsachi...@gmail.com

This doesn't illustrate whether the bug is with kombu or nameko, because you're using both. Can you produce it using kombu alone?

tsachi...@gmail.com

unread,

Nov 21, 2016, 10:10:57 AM11/21/16

to nameko-dev

Well, digging into kombu and nameko code, it is in kombu connection.
See my second post in this thread for an issue opened for this bug.

You can produce it easily in kombu.
Create a connection, channel, consumer and queue. Start consuming in a loop. Then stop the master rabbit node and see the errors that the consumer throws.

tsachi...@gmail.com

unread,

Nov 22, 2016, 4:07:35 AM11/22/16

to nameko-dev, tsachi...@gmail.com

By the way, we are now testing a configuration of rabbit nodes behind ELB, and this is not good either.

Once you stop the master node (and wait for the ELB to remove it from service), nameko starts throwing "IOError: Socket closed" exceptions, exclusive queues created by ClusterRpcProxy become locked, and things look really bad.

Did you try to see what happens when you stop the master?

Matt Yule-Bennett

unread,

Dec 5, 2016, 1:06:02 PM12/5/16

to nameko-dev, tsachi...@gmail.com

I've done some testing with ELB.

The "IOError: socket closed" exceptions are expected. Kombu prints the stacktrace when it detects the disconnection, and then immediately tries to reconnect again. It will keep trying until the connection can be re-established, which should be as soon as the ELB redirects traffic to the other node.

With nameko 2.4.4 you will see "disconnected while waiting for reply" from the client, which is also expected. This will be raised for any requests were in flight when the connection was lost because there's no way to know whether the reply was swallowed by a reply-queue being auto-deleted.

Things behave better after the changes in https://github.com/nameko/nameko/pull/383. Critically, increasing the safety_interval in consume() stops the ResourceLocked exception being thrown by the RPC proxy (although it's worth nothing that the client should recover even in this case).

The changes in https://github.com/nameko/nameko/pull/337 are also required for nameko to be truly tolerant of disconnections. Without it, publishers will lose messages immediately after a disconnection, which often leads to hanging workers (e.g. when an RPC reply message is lost, the caller waits forever)

I expect #337 to land soon, but in the mean time are you able to do some testing with 2.4.4?

tsachi...@gmail.com

unread,

Dec 6, 2016, 3:42:08 AM12/6/16

to nameko-dev, tsachi...@gmail.com

Hi Matt,

No, we're still running nameko 2.2.0.

We solved the disconnections issue by monkey-patching the kombu package (essentially fixing a bug in it).

We're not using an ELB, since in our tests it doesn't work well. We prefer (not ideally, but works better) to work in a HA mode, and let the clients connect to all active rabbit nodes.

This architecture didn't work due to the bug in kombu, which, as mentioned above, we monkey-patched.

I will take a look at nameko 2.4 soon. At the moment, things seem to work just fine.

Tsachi

Matt Yule-Bennett

unread,

Dec 6, 2016, 5:14:18 AM12/6/16

to nameko-dev, tsachi...@gmail.com

Glad to hear you got it working. So you went back to passing multiple URIs and letting kombu round-robin by itself?

What was the monkey-patch / bug-fix you had to apply to kombu?

Matt.

tsachi...@gmail.com

unread,

Dec 6, 2016, 5:49:10 AM12/6/16

to nameko-dev, tsachi...@gmail.com

That is correct. All clients are instantiated with multiple URIs, and kombu round-robins the hosts in case of the master failure.

Our patch looks like this:

from kombu.connection import Connection

original_info = Connection._info


def _info(self, resolve=True):
    # Fixes a bug in kombu.Connection._info method
    info = original_info(self, resolve=resolve)

    info = list(info)

    # Last item is the 'alternates' param. Remove it and replace the 'hostname'
    _, alt = info.pop()
    if alt:
        info[0] = ('hostname', ';'.join(alt))

    return tuple(info)

Connection._info = _info

Import this piece of code before kombu is imported, and it will fix the round-robin bug.

Tsachi

Reply all

Reply to author

Forward