Failover should suspend faulty/unresponsive target node

gonfi

unread,

Jun 26, 2014, 10:41:35 AM6/26/14

to membrane...@googlegroups.com

It seems that failover configured by loadbalancing (balancer, clusters, cluster, 2 nodes) does not suspend a node that does not reply as it should.

In my case I sometimes get the failure as a http connection failure (service offline), sometimes as a http 500 (service offline but http proxy in between returns that). I have router/transport/httpClient failOverOn5XX="true" enabled.

My response times with 1 node for 1000 requests are around 3000ms, with 2 nodes less, and with 2 nodes when one is offline it takes around 60000ms (20x more). A lot of time is wasted trying to use the faulty node. No requests fail after all, it's just a huge time waste.

I had assumed that this was already implemented, with a background task to periodically check the faulty node for availability, and then putting it back in the list of available nodes. Is it not? Where and in what form would I go to implement this? As an interceptor? Any pointer?

Gonfi

Thomas Bayer

unread,

Jun 27, 2014, 12:22:19 AM6/27/14

to membrane...@googlegroups.com

Hi Gonfi,
you can monitor your nodes using nagios or an other monitoring tool. If a node goes down, you can tell Membrane over a simple REST API, to take out that node from loadbalancing. Later the monitoring tool can tell Membrane to use that node again. Have a look at the loadbalancer-client-2 example in the examples folder of the distribution for details.

--
Thomas

Am 26/06/14 16:41, schrieb gonfi:

--
You received this message because you are subscribed to the Google Groups "membrane-monitor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to membrane-monit...@googlegroups.com.
To post to this group, send email to membrane...@googlegroups.com.
Visit this group at http://groups.google.com/group/membrane-monitor.
For more options, visit https://groups.google.com/d/optout.

gonfi

unread,

Jun 27, 2014, 4:57:14 AM6/27/14

to membrane...@googlegroups.com

ah, that's nice, that will work. thanks very much for the quick answer.

gonfi

unread,

Aug 17, 2014, 2:42:10 PM8/17/14

to membrane...@googlegroups.com

Success story, that worked well. The API for informing membrane from the outside is nice.

gonfi

unread,

Sep 13, 2015, 9:19:13 PM9/13/15

to membrane-monitor

I'm getting back to this as the external monitoring only works to some extent in practice.
My goal is to create an alternative to the RoundRobinStrategy.

The idea is that it keeps track of the most recent outcomes of Node s.
That information is present in the HttpClient. Example outcomes are:
- status code 5xx
- various forms of exceptions
ConnectionException, SocketException, UnknownHostException, etc.
- other
includes success 2xx codes, redirects 3xx codes and 4xx user errors.

This information is currently lost. I'd need to connect my DispatchingStrategy with the
HttpClient to listen on events. Any hint on how to do that? I'm imagining an event
listener. There's the HTTPClientInterceptor but I don't see how that could help.

Note: In my case, the 5xx codes are independent of the calling user. I can treat any
5xx as "server has issues". For other users this may vary.

The details of the impl are not very clear yet. For example in case of an UnknownHostException
there could be a background thread trying to resolve it, and until then the Node would not be selected.
For other errors it could compute the most recent success rate, and give each Node a chance
of being selected in relation to that.

I imagine it would be useful for others too. Any pointers for getting started much appreciated ;-)
I'd like to implement the logic, but not spend a lot of time trying to get it integrated.

Best,
Gonfi

Thomas Bayer

unread,

Sep 14, 2015, 4:59:45 AM9/14/15

to membrane...@googlegroups.com

Hi gonfi,
I think you should be able to extend the balancer by implementing you own DispatchingStrategy. The inteface looks like this:

public interface DispatchingStrategy {

public Node dispatch(LoadBalancingInterceptor interceptor, Exchange exc) throws EmptyNodeListException;

public void done(AbstractExchange exc);

}

See the ByThreadStrategy als an example the uses the done method.

https://github.com/membrane/service-proxy/blob/master/core/src/main/java/com/predic8/membrane/core/interceptor/balancer/ByThreadStrategy.java

Inside the done method you get all the information about the exchange ( request & response ) to realize very flexible dispatching. Just add an instance Variable to your DispatchingStrategy the keeps the historic data to make a decission. You should be able to do this just by adding your Jar to membrane and by configuring it using spring without changing a single line of code from membrane.

Give us a pull request if you want to share your contribution later with the community.

Cheers,
Thomas

Am 14.09.15 um 03:19 schrieb gonfi:

Gonfi den Tschal

unread,

Sep 14, 2015, 8:07:05 AM9/14/15

to membrane...@googlegroups.com

Thanks a lot for your quick and informative reply. That sounds very smooth. I'll come back when I have something to share.

--
You received this message because you are subscribed to a topic in the Google Groups "membrane-monitor" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/membrane-monitor/CE8bcSDUIqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to membrane-monit...@googlegroups.com.

Gonfi den Tschal

unread,

Sep 14, 2015, 9:33:52 AM9/14/15

to membrane...@googlegroups.com

It looks like the done() method of the DispatchingStrategy interface is never called.

I've set a break point, and am running the LoadBalancingInterceptorTest. dispatch() breakpoint yes, done() no.

Also, I've searched the whole code base, nothing.

The dispatch() is called from the LoadBalancingInterceptor, and as you can see there is no done():
https://github.com/membrane/service-proxy/blob/9a40f4dbd61475d0eeac3010d12741fa51806826/core/src/main/java/com/predic8/membrane/core/interceptor/balancer/LoadBalancingInterceptor.java

Besides that, I still don't see how I'll have the whole history of attempts in the Exchange. Won't I just have the last one,

the one that probably succeeds? If there is one Node constantly failing, this will be unseen by my done() code. That's

what I think so far, maybe I'm wrong, debugging will show.

PS: the LoadBalancingInterceptorTest is very nice to have.

Greetings

Gonfi

Thomas Bayer

unread,

Sep 15, 2015, 12:08:24 PM9/15/15

to membrane...@googlegroups.com

Hi Gonfi,

Am 14.09.15 um 15:33 schrieb Gonfi den Tschal:

It looks like the done() method of the DispatchingStrategy interface is never called.

I've set a break point, and am running the LoadBalancingInterceptorTest. dispatch() breakpoint yes, done() no.

Also, I've searched the whole code base, nothing.

The dispatch() is called from the LoadBalancingInterceptor, and as you can see there is no done():
https://github.com/membrane/service-proxy/blob/9a40f4dbd61475d0eeac3010d12741fa51806826/core/src/main/java/com/predic8/membrane/core/interceptor/balancer/LoadBalancingInterceptor.java

You are right. The done() call is missing. I think we can add it to the class LoadBalancingInterceptor.java like so:

@Override
    public Outcome handleResponse(Exchange exc) throws Exception {
        log.debug("handleResponse");

        if (sessionIdExtractor != null) {
            String sessionId = getSessionId(exc.getResponse());

            if (sessionId != null) {
                balancer.addSession2Cluster(sessionId, Cluster.DEFAULT_NAME, (Node) exc.getProperty("dispatchedNode"));
            }
        }

        updateDispatchedNode(exc);
strategy.done(exc);

        return Outcome.CONTINUE;

}

Besides that, I still don't see how I'll have the whole history of attempts in the Exchange. Won't I just have the last one,

the one that probably succeeds? If there is one Node constantly failing, this will be unseen by my done() code. That's

what I think so far, maybe I'm wrong, debugging will show.

Have a look at the HTTP client code. There is the code for the retries. Please let me know if you need more help.

Cheers,
Thomas

Reply all

Reply to author

Forward