Redundancy design for mbus.

160 views
Skip to first unread message

yssk22

unread,
Apr 23, 2012, 10:21:40 PM4/23/12
to vcap-dev
Hi

I think mbus (NATS) would be a SPoF in the current vcap architecture.
It should have 'scalability' and 'high-availability'. And I know the
original repository of NATS's is working on cluster feature branch so
we would get both of them in the future.

But at this time, I'd like to discuss/share 'high-availablity'
workaround or know-how with the core developer team.

I found the facts:

- All of components except for CloudController would exit if NATS
raise an error (like connection errors).
- If a DEA exits with the NATS error, applications on the DEA still
work and can respond the requests.
- After a while DEA exits, routers unregister applications on the dead
DEA because heartbeats fail. (*A)

I took an workaround scenario for NATS server failure.

1. if nats server is down, then the IP address of NATS should be moved
to another server, where NATS is restarted.
# I use heartbeat&pacemaker to do this
2. DEAs can reconnect the new NATS server because the IP is not
changed.
# I tuned 'NATS_MAX_RECONNECT_ATTEMPTS' and
'NATS_RECONNECT_TIME_WAIT' environment variable for clients to
reconnect.
3. DEAs or other components will exit with 'Authorization is required'
error when they send a command after reconnecting NATS server.
4. Restart DEAs and they check the application processes which have
already been running on them.

If it take the less time than heartbeat time (*A) to do #1~#4, we do
not need any operations to recover our vcap cluster.

How do you think of this? Do you have any plan or workarounds for mbus
redundancy?

Mike Heath

unread,
Apr 30, 2012, 3:42:19 PM4/30/12
to vcap...@cloudfoundry.org
I too am concerned about NATS being a SPoF. I'm also concerned by the fact that Derek Collison hasn't touched NATS in 2 months. Is this something he's going to be able to get fixed now that he's gone off and started Apcera? Are there others working on NATS?

-Mike

Jesse Zhang

unread,
Apr 30, 2012, 3:50:19 PM4/30/12
to vcap...@cloudfoundry.org
NATS will not easily fail. Instead of removing the single point of failure on NATS, it's actually an arguably better way to evolve the system to be resistant to temporary NATS outage.

Jesse

Mike Heath

unread,
Apr 30, 2012, 4:22:37 PM4/30/12
to vcap...@cloudfoundry.org
Certainly one of the more brilliant aspects of the vcap architecture is the fact that the system will still function if NATS goes down. As far as availability goes, that's great.

However, as is stated on that NATS wiki itself (just about the only thing in the NATS wiki :) ), there are other reasons for supporting more than just a single server model such as performance and scalability. Granted, I haven't had issues with either of these using NATS...

Maybe I'm just paranoid but moving to a clustered approach would make me feel better about operating our private vcap offering.

-Mike

Phil Whelan

unread,
Apr 30, 2012, 6:37:21 PM4/30/12
to vcap...@cloudfoundry.org
On Tue, May 1, 2012 at 4:50 AM, Jesse Zhang <je...@rbcon.com> wrote:
NATS will not easily fail. Instead of removing the single point of failure on NATS, it's actually an arguably better way to evolve the system to be resistant to temporary NATS outage.

I think evolving the system to be NATS outage resilient is a good path to take, but I also think removing the SPoF from NATS is important. While the system may not collapse without NATS, it's entire communication network is lost and it would involve a dreaded 3am wake-up call to manually get the system fully functioning again.

As noted on the design doc, HA isn't the only thing that NATS cluster support aims to address.
"Currently NATS is a single server design, which limits scalability, performance, and HA."

Phil

yssk22

unread,
May 1, 2012, 4:08:17 AM5/1/12
to vcap-dev
> I think evolving the system to be NATS outage resilient is a good path to
> take, but I also think removing the SPoF from NATS is important.

+1.

I think NATS cluster is the final solution but we'd like to share the
current reference deployments / knowledges for automatic failover for
NATS. In my environment, Heartbeat with IPaddr2 seems work for NATS
failover although clients logs auth errors on failover.
# We can tune, 'temporary outage' via NATS_MAX_RECONNECT_ATTEMPTS and
NATS_RECONNECT_TIME_WAIT env variables on the client side.

I can agree with " NATS will not easily fail", but hardware/network of
NATS is not so.

On 5月1日, 午前7:37, Phil Whelan <phil...@gmail.com> wrote:
> On Tue, May 1, 2012 at 4:50 AM, Jesse Zhang <je...@rbcon.com> wrote:
> > NATS will not easily fail. Instead of removing the single point of failure
> > on NATS, it's actually an arguably better way to evolve the system to be
> > resistant to temporary NATS outage.
>
> I think evolving the system to be NATS outage resilient is a good path to
> take, but I also think removing the SPoF from NATS is important. While the
> system may *not* collapse without NATS, it's entire communication network
> is lost and it would involve a dreaded 3am wake-up call to manually get the
> system fully functioning again.
>
> As noted on the design doc, HA isn't the only thing that NATS cluster
> support aims to address.https://github.com/derekcollison/nats/wiki/Cluster-Design
> *"Currently NATS is a single server design, which limits scalability,
> performance, and HA."*
>
> Phil

Mike Heath

unread,
May 1, 2012, 12:25:24 PM5/1/12
to vcap...@cloudfoundry.org


On Tuesday, May 1, 2012 2:08:17 AM UTC-6, yssk22 wrote:
I think NATS cluster is the final solution but we'd like to share the
current reference deployments / knowledges for automatic failover for
NATS. In my environment, Heartbeat with IPaddr2 seems work for NATS
failover although clients logs auth errors on failover.

I'm pretty sure there's a bug in the NATS client when reconnecting to a NATS server that requires authentication. The NATS clients tries to resubscribe and send any pending publish commands before authenticating with the NATS server. I haven't confirmed this 100% in the Ruby client but I'm relatively certain the Node.js client has this problem. I had the same problem in the Java NATS client that I wrote (https://github.com/mheath/jnats).

-Mike

Ken Robertson

unread,
May 1, 2012, 2:09:12 PM5/1/12
to vcap...@cloudfoundry.org
One thing I've done to address the NATS SPoF is to configure NATS in a sort of active-passive mode.  I run NATS on two separate boxes with keepalived and a floating IP (this is definitely not as EC2 friendly).  If NATS on one box goes down or the whole box dies, the other takes over and clients will reconnect.

Overall, I haven't experienced any issues with it.  Some messages may be lost, but overall the components are built around that and operations will be tried again.

Ken

Jesse Zhang

unread,
May 1, 2012, 2:17:35 PM5/1/12
to vcap...@cloudfoundry.org
Can you kindly reproduce the NATS client bug in a minimal way?

Jesse

On Tue, May 1, 2012 at 9:25 AM, Mike Heath <elc...@gmail.com> wrote:

Mike Heath

unread,
May 1, 2012, 5:36:48 PM5/1/12
to vcap...@cloudfoundry.org
If it weren't for queue groups, writing an HA version of NATS server would be very straight forward. Is there anything in vcap that actually uses queue groups?


-Mike

On Monday, April 23, 2012 8:21:40 PM UTC-6, yssk22 wrote:

Mike Heath

unread,
May 1, 2012, 5:37:31 PM5/1/12
to vcap...@cloudfoundry.org
I'll try to put together a simple test case to validate my theory.

Matt Page

unread,
May 1, 2012, 6:22:19 PM5/1/12
to vcap...@cloudfoundry.org
> If it weren't for queue groups, writing an HA version of NATS server would be very straight forward.

If you don't care about the round-robin nature of queue groups (only that
the distribution of messages is fair), a probabilistically correct
implementation
doesn't seem too challenging.

> Is there anything in vcap that actually uses queue groups?
Both the HealthManager (when sending requests to CloudControllers) and
the CloudController (when sending requests to Stagers) use queue groups.

This is something that we'd like to change in the future by eliminating all
command-and-control traffic that flows over NATS. Ideally, we would only
used NATS for discovery and rely on point-to-point RPC calls for all other
requests.

Cheers,
Matt

Dr Nic Williams

unread,
May 1, 2012, 7:18:17 PM5/1/12
to vcap...@cloudfoundry.org

This is something that we'd like to change in the future by eliminating all
command-and-control traffic that flows over NATS. Ideally, we would only
used NATS for discovery and rely on point-to-point RPC calls for all other
requests.
Why is RPC more ideal?

Nic 

Mike Heath

unread,
May 1, 2012, 7:45:33 PM5/1/12
to vcap...@cloudfoundry.org
IMO, only being a user of CF, RPC is more ideal because NATS' primary function is as a broadcast mechanism whereas RPC is only point-to-point. By using NATS for discovery, you cater to NATS' strength. Command-and-control is a point-to-point type operation so using RPC makes a lot of sense.

-Mike

Mike Heath

unread,
May 1, 2012, 7:50:10 PM5/1/12
to vcap...@cloudfoundry.org
If there were no queue groups, each NATS server in a cluster could maintain its own publishing table and simply broadcast every message to each node on the cluster. Queue groups requires each server in the cluster to maintain a list of queue groups across the entire cluster to ensure that a message gets delivered on only one member of the queue group. This greatly complicates things.

If vcap eliminates the use of queue groups, which I think it should, creating a clustered NATS server should be simple.

At the very least, it's given me something to play with during the evenings this week... :)

-Mike

Ken Robertson

unread,
May 2, 2012, 1:45:56 AM5/2/12
to vcap...@cloudfoundry.org
+1 on Nic's Q as well.  I'm curious what the motivation is.

Ken

Ken Robertson

unread,
May 2, 2012, 1:56:07 AM5/2/12
to vcap...@cloudfoundry.org
To your point, I think using NATS for C&C was an early design decision that then shaped a lot of other choices.  Overall, C&C messages in CF are fire-and-forget.  Whether or not the command completes isn't a concern of the requestor.  If it fails, another message will be sent back and handled, or if it is lost, the health manager will notice and trigger the cloud controller to send it again.  Everything is decoupled and works together to maintain symbiosis.

With an RPC structure, I'd imagine more of an expectation of issuing a command, waiting for it to complete, and returning a result.  Then you end up more coupled and more state logic.  Or what if the cloud controller requests something and dies before it is complete?  Or, if it uses RPC in a fire-and-forget fashion, then what is the gain over using NATS?  If you tell it to do something over RPC as fire-and-forget but then it fails, where does that message go?  So what if it is broadcast, that doesn't really affect anything.  In fact I'd consider it useful.  Can easily write a component to listen in on the entire system and gain insight without touching any of the pieces.

That is my view of it...

Ken

Dave Syer

unread,
May 2, 2012, 8:34:21 AM5/2/12
to vcap...@cloudfoundry.org
On 02/05/12 00:50, Mike Heath wrote:
> If vcap eliminates the use of queue groups, which I think it should,
> creating a clustered NATS server should be simple.

Did you see RATS: https://github.com/majek/rats. No idea if it helps,
but RabbitMQ is probably a lot more robust than native NATS. On the
other hand I'm not losing any sleep over NATS myself.

Dave.

--
Dave Syer
ds...@vmware.com

derek

unread,
May 2, 2012, 11:15:03 AM5/2/12
to vcap-dev
I am committed to finishing the clustered version of NATS, and some
future version of NATS will be a core part of Apcera.
The clustered version is almost complete, and only needs retry logic
between servers and client rebalancing. I would encourage you to
experiment with the cluster branch and report back any issues.

=derek

derek

unread,
May 2, 2012, 11:18:17 AM5/2/12
to vcap-dev
Currently the system (vcap) can recover from brief NATS outages. When
the system is being updated at a VM level, and the NATS process
associated with that VM stays away for an extended period of time, the
client will error, and as some have noted, certain system components
may exit.

When designing the clustered version, the first issue to solve was the
availability. Even with quite large systems, a single NATS server has
no problems with client connections or message throughput.

That being said, better performance is always a good thing, and I have
been exploring a high performance server component.

=derek

derek

unread,
May 2, 2012, 11:20:52 AM5/2/12
to vcap-dev
The NATS client will (should) drop messages when re-connecting, but it
will re-establish its subscription state. The authentication obviously
needs to be the same in the current version, but multiple auths are
allowed and supported in the clustered version.

The Ruby client should be considered the golden master at this point.
If there is an issue with the Node.js client, please let me know..

=derek

derek

unread,
May 2, 2012, 11:22:11 AM5/2/12
to vcap-dev
The client reconnect logic, and distributed queues are the most
challenging. Both are complete in the NATS cluster branch today. I
need to finish up timer logic on reconnects between servers (routes)
and rebalancing within the client pool.

=derek

derek

unread,
May 2, 2012, 11:24:59 AM5/2/12
to vcap-dev
One of the big things that people overlook is the visibility that
using an mbus provides. Moving pieces to a direct RPC mechanism may
make sense, sans the distributed queuing mechanism, but the loss of
visibility into the system is what one should be most concerned with..
`nats-sub >` is a powerful tool, since NATS treats every messaging
pattern underneath as pub/sub, meaning you can watch the entire system
with the fore-mentioned command.

=derek

On May 1, 5:22 pm, Matt Page <mp...@rbcon.com> wrote:
> > If it weren't for queue groups, writing an HA version ofNATSserver would be very straight forward.
>
> If you don't care about the round-robin nature of queue groups (only that
> the distribution of messages is fair), a probabilistically correct
> implementation
> doesn't seem too challenging.
>
> > Is there anything in vcap that actually uses queue groups?
>
> Both the HealthManager (when sending requests to CloudControllers) and
> the CloudController (when sending requests to Stagers) use queue groups.
>
> This is something that we'd like to change in the future by eliminating all
> command-and-control traffic that flows overNATS. Ideally, we would only
> usedNATSfor discovery and rely on point-to-point RPC calls for all other
> requests.
>
> Cheers,
> Matt
>
>
>
>
>
>
>
> On Tue, May 1, 2012 at 2:36 PM, Mike Heath <elc...@gmail.com> wrote:
> > If it weren't for queue groups, writing an HA version ofNATSserver would
> > be very straight forward. Is there anything in vcap that actually uses queue
> > groups?
>
> > -Mike
>
> > On Monday, April 23, 2012 8:21:40 PM UTC-6, yssk22 wrote:
>
> >> Hi
>
> >> I think mbus (NATS) would be a SPoF in the current vcap architecture.
> >> It should have 'scalability' and 'high-availability'. And I know the
> >> original repository ofNATS'sis working on cluster feature branch so
> >> we would get both of them in the future.
>
> >> But at this time, I'd like to discuss/share 'high-availablity'
> >> workaround or know-how with the core developer team.
>
> >> I found the facts:
>
> >> - All of components except for CloudController would exit ifNATS
> >> raise an error (like connection errors).
> >> - If a DEA exits with theNATSerror, applications on the DEA still
> >> work and can respond the requests.
> >> - After a while DEA exits, routers unregister applications on the dead
> >> DEA because heartbeats fail. (*A)
>
> >> I took an workaround scenario forNATSserver failure.
>
> >> 1. ifnatsserver is down, then the IP address ofNATSshould be moved
> >> to another server, whereNATSis restarted.
> >>   # I use heartbeat&pacemaker to do this
> >> 2. DEAs can reconnect the newNATSserver because the IP is not

derek

unread,
May 2, 2012, 11:28:22 AM5/2/12
to vcap-dev
NATS primary function is an always on, distributed messaging system
with very specific patterns not found in other systems. It does not do
durability, or transactional semantics, it operates like a nervous
system to some degree. Certain patterns are seen in other systems, pub/
sub, queue, RPC, but some are unique, such as ask a question of N
(where N can be large) and only give me M (M is small) answers. This
can be achieved with other systems, but at a high cpu cost per the
requesting client. NATS is able to proactively prune the interest
graph via information in the request that indicates how many answers
are wanted/expected.

Also, see "visibility" in another reply, which is the most often
overlooked strength to this approach.

=derek


On May 1, 6:45 pm, Mike Heath <mhe...@apache.org> wrote:
> IMO, only being a user of CF, RPC is more ideal because NATS' primary
> function is as a broadcast mechanism whereas RPC is only point-to-point. By
> using NATS for discovery, you cater to NATS' strength. Command-and-control
> is a point-to-point type operation so using RPC makes a lot of sense.
>
> -Mike
>

derek

unread,
May 2, 2012, 11:33:08 AM5/2/12
to vcap-dev
Also note that there is a NATS google group, which I will be checking
more frequently. natsio

On May 2, 10:28 am, derek <derek.colli...@gmail.com> wrote:
> NATSprimary function is an always on, distributed messaging system
> with very specific patterns not found in other systems. It does not do
> durability, or transactional semantics, it operates like a nervous
> system to some degree. Certain patterns are seen in other systems, pub/
> sub, queue, RPC, but some are unique, such as ask a question of N
> (where N can be large) and only give me M (M is small) answers. This
> can be achieved with other systems, but at a high cpu cost per the
> requesting client.NATSis able to proactively prune the interest
> graph via information in the request that indicates how many answers
> are wanted/expected.
>
> Also, see "visibility" in another reply, which is the most often
> overlooked strength to this approach.
>
> =derek
>
> On May 1, 6:45 pm, Mike Heath <mhe...@apache.org> wrote:
>
>
>
>
>
>
>
> > IMO, only being a user of CF, RPC is more ideal becauseNATS' primary
> > function is as a broadcast mechanism whereas RPC is only point-to-point. By
> > usingNATSfor discovery, you cater toNATS' strength. Command-and-control
> > is a point-to-point type operation so using RPC makes a lot of sense.
>
> > -Mike
>
> > On Tue, May 1, 2012 at 5:18 PM, Dr Nic Williams <drnicwilli...@gmail.com>wrote:
>
> > > This is something that we'd like to change in the future by eliminating all
> > > command-and-control traffic that flows overNATS. Ideally, we would only
> > > usedNATSfor discovery and rely on point-to-point RPC calls for all other

Matt Page

unread,
May 2, 2012, 1:30:22 PM5/2/12
to vcap...@cloudfoundry.org
Maybe RPC isn't the proper term here. Moving to direct connections for
point-to-point requests
doesn't imply that each request expects a response. The
fire-and-forget nature of a lot of CF's design
is wonderful and isn't something that I want to change.

Distributed "queues" can be implemented fairly easily using heartbeats
and point-to-point
RPC. For example,

1. Each node in the queue publishes a heartbeat that contains a piece
of information stating
that it belongs to the queue.
2. Nodes that wish to send requests to members of the queue construct
a local map of queue
members using information contained in the heartbeats.
3. Before sending a request to the queue, the requestor chooses a node
uniformly at random from
its local view of queue members (from the map in #2) and asks that
node to perform the request.

I agree that NATS provides a powerful way to get insight into the
system. However, proper component
logging and log aggregation bridge the gap well.

Cheers,
Matt

derek

unread,
May 2, 2012, 2:36:47 PM5/2/12
to vcap-dev
Sans the visibility issue, and one of my beliefs that you never make
assumptions about what data can be used for, there are other issues
with your approach.

Consider the fact of distributed state and connections/communications.
In this regard messaging systems try to optimize for these while
giving up some latency (due to multiple hops etc).

What I mean is that as your client population grows and becomes larger
than you messaging cluster (usually quite small, and for NATS today is
a singleton), the aggregate size of the distributed state and change
management, as well as the number of client connections grows..

E.g, with a messaging system that has a queue group of 5, providing
some service, and a large client population, say 100..

With a messaging system, each member of the queue group has 1
connection to the messaging system, and all 100 clients have
connections to the messaging systems as well. The messaging system has
105 connections.

With direct RPC, each queue member has 100 connections (x 5 = 500). As
you can see it is linear with respect to the queue group size.

Also, the state of the groups themselves are now duplicated in all 100
clients, versus in the messaging system itself..

You can delete connections and only spin them up when needed with
direct RPC, but this introduces more latency than in scenario #1,
which IMO is the only nominal downside to the architecture.

=derek
Reply all
Reply to author
Forward
0 new messages