crossbar.io production readiness

901 views
Skip to first unread message

paradox7

unread,
Dec 23, 2014, 4:21:51 PM12/23/14
to autob...@googlegroups.com
Hi,

We are considering building our system with WAMP (migrating from the good old HTTP stack). While we are excited about the technology, since we will need to go live in a few months, am wondering the production readiness, a few questions below:

crossbar.io router obviously is a key piece in a distributed system, in order for it not to be a SPOF (single-point-of-failure), there should be a cross-host cluster that provides auto-recovery/fallback so when one instance dies, the other will take over. I see clustering is indeed mention in the architecture document but I can't find any example for it, and it seems to be marked "planned feature", so wondering if it is available today? If not, is there a date for this feature?

If we need to deploy before such cluster technology, what is the alternative architecture suggestion? can we utilize the existing load-balancer, such as amazon's elastic load-balancer? Basically deploy many "local crossbar.io/containers" and let the load-balancer route the traffic on top? Any information on this front will greatly help us plan the production deployment strategy short-term/long-term.

Is there any performance info for the router, such as throughput ? As far as you know, is there any mid - large scale production system using crossbar.io (or any other WAMP routers) today?

Thanks.

    Tobias Oberstein

    unread,
    Dec 30, 2014, 8:00:16 AM12/30/14
    to autob...@googlegroups.com
    Hi,

    regarding "production readiness":

    You are not alone .. there are multiple people/companies currently investigating/building stuff on top of WAMP / Crossbar.io in private. Some of them don't want to even disclose / talk yet. I guess the current situation (people unsure about viability of this tech) is unavoidable .. but transient.

    Regarding routing performance: the only hard numbers we currently have is here http://tavendo.com/blog/post/autobahn-pi-benchmark/
    In general I'd expect a single instance of Crossbar.io (running a router on a single thread) to scale to 100-200k concurrent connections.
    What is a "mid - large scale production system"? Volume of connections/messages?

    Note that by using worker processes for your WAMP app components (all connected to a single router), you can already scale up/out the _app logic_ today. What you cannot scale up/out yet is the _routing_ core of Crossbar.io itself.

    This feature (router-to-router clustering/federation) _will_ come to Crossbar.io - it seems a lot of people are waiting for this. It'll arrive "in the coming months". I am sorry I can't be more specific.

    From a practical point of view, here is what you can do today for HA:

    - Have a hot-standby Crossbar.io instance (that is one already running, but with _no_ clients connected).
    - When the primary fails, failover to the former (either using a LB, or by taking over the IP of the former)
    - When the former fails, all clients (both frontend and backend components) will loose their connections
    - All clients will (should) automatically reconnect (as e.g. AutobahnJS does), and hence connect to the standby

    Note that for above to work, your backend components will need to connect via the LB as the frontend components, so that the LB can forward the connection to the standby upon reconnection. An issue with this might be when frontend components connect faster than backend components (which then won't be e.g. callable from frontends yet).

    Another note: we are not planning to failover an established WAMP session from Crossbar.io instance 1 to 2. This would be really complex for various reasonbs. Instead, we will rely on WAMP clients autoreconnecting - which at least frontend clients will need to do anyway (think mobile networks with intermittent connectivity).

    Please ask again if above is insufficient info for you to go forward. It's exciting to see more and more people joining in .. I'd wish we would have a "full story" for all the valid requests people have. Not there quite yet;)

    Cheers,
    /Tobias

    paradox7

    unread,
    Dec 30, 2014, 9:47:35 AM12/30/14
    to autob...@googlegroups.com
    Thanks Tobias for detailed reply.


    ... Note that by using worker processes for your WAMP app components (all connected to a single router), you can already scale up/out the _app logic_ today.

    here is more concrete ask:
    1. scale-out: app logic containers need to be able to run collapsed (same host as router) as well as in separate hosts. Can crossbar.io manage workers on different hosts?
    2. scale-up: need multiple instances for each app container, for availability as well as throughput scaling.
      • How do we do this with RPC registration? the router will not allow redundant endpoint registration, does it?
      • How do we do this with subscription that only 1 process within this "redundant processes group" get to process the given message.
      • Can crossbar.io hold connection with the container through a container-side proxy (i.e. a load balancer fronting the containers)? Or maybe crossbar.io has a concept of "worker group" that only 1 instance of such workers will be selected to process the incoming message ?

    From a practical point of view, here is what you can do today for HA:

    - Have a hot-standby Crossbar.io instance (that is one already running, but with _no_ clients connected).
    - When the primary fails, failover to the former (either using a LB, or by taking over the IP of the former)
    - When the former fails, all clients (both frontend and backend components) will loose their connections
    - All clients will (should) automatically reconnect (as e.g. AutobahnJS does), and hence connect to the standby

    For short term, this is acceptable, provided it didn't happen often ;-)


    Another note: we are not planning to failover an established WAMP session from Crossbar.io instance 1 to 2. This would be really complex for various reasonbs. Instead, we will rely on WAMP clients autoreconnecting - which at least frontend clients will need to do anyway (think mobile networks with intermittent connectivity).

    agree.

     

    Tobias Oberstein

    unread,
    Dec 30, 2014, 10:30:01 AM12/30/14
    to autob...@googlegroups.com
    Hi,

    > here is more concrete ask:
    >
    > 1. scale-out: app logic containers need to be able to run collapsed
    > (same host as router) as well as in separate hosts. Can crossbar.io
    > manage workers on different hosts?

    A Crossbar.io instance can only start workers locally.

    This allows you to scale up your app logic (that is run the logic on
    multiple core on the local machine) by starting more workers managed
    from Crossbar.io

    However, a Crossbar.io instance can (obviously) accept incoming WAMP
    connections from app components from anywhere. This is what allows
    scale-out your app logic today - run the components on other hosts.

    Those connecting clients won't be managed/monitored by Crossbar.io as
    workers then though.

    Here is where we want to go:

    Have a cluster/federation of Crossbar.io instances where you can start
    workers transparently on any of the nodes. Or have Crossbar.io
    automatically make placement decisions (like fire up a worker on a least
    loaded host). Or have Crossbar.io fire up a worker in a OS container
    (think Docker). Etc.

    This latter stuff is pointing to an exciting perspective: Crossbar.io as
    a complete microsservice platform.

    Would be interesting to me where you actually want to go with
    Crossbar.io in your app/project ...


    > 2. scale-up: need multiple instances for each app container, for
    > availability as well as throughput scaling.
    > * How do we do this with RPC registration? the router will not
    > allow redundant endpoint registration, does it?

    Again, in the pipeline: the WAMP "Advanced Profile" talks about this
    under the term "distributed/partitioned RPC/PubSub".

    Regarding HA-Callees: https://github.com/tavendo/WAMP/issues/89

    > * How do we do this with subscription that only 1 process within
    > this "redundant processes group" get to process the given message.

    Not sure what you mean here .. can you expand on the behavior you envision?

    > * Can crossbar.io hold connection with the container through a
    > container-side proxy (i.e. a load balancer fronting the
    > containers)? Or maybe crossbar.io has a concept of "worker
    > group" that only 1 instance of such workers will be selected to
    > process the incoming message ?

    Kind of the latter. Crossbar.io will implement different policies, like
    round-robin, random, .. for directing e.g. a specific call to respective
    endpoints (callees). There is no LB involved. Crossbar.io is a
    WAMP-level LB.

    >
    >
    > From a practical point of view, here is what you can do today for HA:
    >
    > - Have a hot-standby Crossbar.io instance (that is one already
    > running, but with _no_ clients connected).
    > - When the primary fails, failover to the former (either using a LB,
    > or by taking over the IP of the former)
    > - When the former fails, all clients (both frontend and backend
    > components) will loose their connections
    > - All clients will (should) automatically reconnect (as e.g.
    > AutobahnJS does), and hence connect to the standby
    >
    >
    > For short term, this is acceptable, provided it didn't happen often ;-)

    FWIW, I haven't seen our public Crossbar.io instance (which runs on EC2)
    collapse even once. It's restarted after Crossbar.io or OS upgrades.
    That's it.

    Cheers,
    /Tobias

    paradox7

    unread,
    Dec 30, 2014, 4:38:11 PM12/30/14
    to autob...@googlegroups.com
    Yes, looks like "distributed/partitioned RPC/PubSub" could solve the scalability and availability issues we have. However the spec is not yet stable/finalized, and I agree the router needs to tackle its own availability and scalability first (clustering for example) before scaling the workers according to the the spec above... I wonder if there is a quick win by leveraging existing proven infrastructure to scale workers, that it will not only ease the early adopters concern, also buy precious time/experience to take crossbar.io to the next level...

    It will be really cool and immediately make crossbar.io production-ready for us, if it can communicate with the workers through LB proxy fronting app workers. For example Amazon Elastic Load Balancer (ELB) seems to support WebSocket via TCP protocol (http://blog.flux7.com/web-apps-websockets-with-aws-elastic-load-balancing). If we have 2 app workers behind ELB, all registered for the same endpoint, crossbar.io will probably reject the 2nd registrations attempt, but as long as it route messages to ELB based on the first registration, we will be ok since ELB will take care of the load-balancing, distributed process management etc. It seems that crossbar.io should be able to do this today, am I missing something?

    To sum up, we don't mind managing the workers scalability ourselves via existing LB technology; we don't mind occasional manual fallback with crossbar.io; we can live with single crossbar.io throughput limit for the next few months. if crossbar.io can function under these assumptions, then we have a winner!

    Tobias Oberstein

    unread,
    Dec 31, 2014, 5:59:58 AM12/31/14
    to autob...@googlegroups.com
    Hi,

    Am 30.12.2014 um 22:38 schrieb paradox7:
    > Yes, looks like "distributed/partitioned RPC/PubSub" could solve the
    > scalability and availability issues we have. However the spec is not yet
    > stable/finalized, and I agree the router needs to tackle its own
    > availability and scalability first (clustering for example) before
    > scaling the workers according to the the spec above... I wonder if there

    We need both, yes.

    But they are orthogonal, and the distributed/partitioned stuff is
    definitely easier.

    Allowing Crossbar.io to maintain multiple endpoints for a given
    procedure, and then randomly select one when a call comes in is straight
    forward. I have avoided to add this point feature as a quick hack since
    I wanted to get the "big picture" conceptually right .. which then
    includes stuff like "partitioned RPC" etc.

    > If we have 2 app workers behind ELB, all registered for the same
    > endpoint, crossbar.io will probably reject the 2nd registrations
    > attempt, but as long as it route messages to ELB based on the first
    > registration, we will be ok since ELB will take care of the
    > load-balancing, distributed process management etc. It seems that
    > crossbar.io should be able to do this today, am I missing something?

    Unfortunately, yes.

    Workers are not _listening_ for WebSocket connections coming in from a
    WAMP router (and possibly distributed by a LB), but _connecting_ over as
    WebSocket connection to a router. And since we don't yet have
    router-to-router clustering, all those workers will need to connect to 1
    router. And that router does not yet allow for 2 workers to register the
    same procedure.

    > To sum up, we don't mind managing the workers scalability ourselves via
    > existing LB technology; we don't mind occasional manual fallback with
    > crossbar.io; we can live with single crossbar.io throughput limit for
    > the next few months. if crossbar.io can function under these
    > assumptions, then we have a winner!

    If this is a deal breaker for you ("lack of worker load-balancing" for
    workers registering same procedures), we might implement that point
    feature quicker - you just need to convince me that your project will be
    awesome and push Crossbar.io to reprio my endless work queue ;)

    Cheers,
    /Tobias

    paradox7

    unread,
    Jan 28, 2015, 9:07:32 PM1/28/15
    to autob...@googlegroups.com

    Sorry for the long delay...

    If this is a deal breaker for you ("lack of worker load-balancing" for
    workers registering same procedures), we might implement that point
    feature quicker...

    Yes, it will be a deal breaker when we go to production. The infrastructure has to be capable of scaling beyond single-host-per-component. I looked at the new crossbar roadmap, the whole scalability related features are currently scheduled for release 3 (Aug 2015), which is too late for us. It will be really great if crossbar can consider a more incremental approach to implement this low-hanging fruit earlier so more early adapters can scale to be great real-world WAMP systems.

    Reply all
    Reply to author
    Forward
    0 new messages