Vertigo and Fault Tolerance

53 views
Skip to first unread message

Pablo Alcaraz

unread,
Dec 23, 2014, 5:07:28 PM12/23/14
to vertx-...@googlegroups.com
Hello

I was reading carefully all the documentation. I have a doubt: How does vertigo deal with Fault Tolerance scenarios? We are evaluating using Vertigo in our project. We all like it in the company, but we need to know what happen when.

a) A sender server fails (failure = the server using Vertigo services hangs for any external reason).
1) What happen with the messages sent and still in process by the receiver?

b) A receiver component fails.
1) Let's say we have 2 receivers with the same topic/name. Suddenly 1 fails. What happen with the message being processed by that receiver? Is it resent to the other listener? Will it be received when the component reconnect? Is it lost?

c) A group of sender and receiver fail.
1) What happens to the messages being processed? Are they resent to other listeners? 

d) All the servers using Vertigo network fail. 
1) Let's say all the server fail (perhaps a big network failure) and they disconnect and Vertigo cluster is destroyed. What happen with the messages? Are they resent when the Vertigo cluster is recovered?
2) Other scenario: Vertigo's participant die because external reasons (a sysadmin kill all the process, power issues in servers, etc), what happen with the messages being send but still no received? Are they recovered? Same with the messages received by the listeners but still being processed. Are they resend?

Pablo

Jordan Halterman

unread,
Dec 23, 2014, 10:30:40 PM12/23/14
to vertx-...@googlegroups.com
Hi, thanks for asking. I love talking about this type of stuff!

"How does Vertigo deal with Fault Tolerance scenarios?"

The short answer to this question is:
Vertigo's fault tolerance is not much different from Vert.x in that it will ensure that networks and their components remain running as long as some portion of the cluster is running, and while it will attempt to assist in delivery of messages, it does not guarantee that messages will get from point A to point B once, many times, or ever.

Let me first respond to your specific scenarios and then I'll explain a bit about the past, present, and future of fault tolerance in Vertigo.

a) A sender server fails (failure = the server using Vertigo services hangs for any external reason).
1) What happen with the messages sent and still in process by the receiver?

The message will still be successfully processed. In the current Vertigo implementation, it's up to the receiver to indicate to the sender if a message is received out of order or not received at all. It accomplishes this by internally ensuring that each communication channel (connection) between each Vertigo component is represented by a single event bus address. Each output message for each individual connection is tagged with a monotonically increasing number. Since the Vert.x event bus delivers messages in order, the counter sent along with each message can be used by the receiving side of the connection to efficiently determine whether any messages were lost during transmission. For example, if the receiving side of a connection sees messages 100, 101, 102, and then 104, it knows that message 103 was lost and requests that the sending side of the connection resend the missing message. This allows messages to be effectively acked between a single sender and receiver without the expensive overhead of request-reply communication.

However, note that while the receiving side of the connection will successfully process any messages it has in memory, it will have no knowledge that the sender is dead. Once the sender comes back on line, though, it will resend a connect message to the receiving side of the connection and begin resending messages from 1.

b) A receiver component fails.
1) Let's say we have 2 receivers with the same topic/name. Suddenly 1 fails. What happen with the message being processed by that receiver? Is it resent to the other listener? Will it be received when the component reconnect? Is it lost?

The sending side of connections make an effort only to ensure that messages arrive in order, and some effort - albeit weak - to redeliver lost messages. Indeed, these types of scenarios are the reason Vertigo cannot be relied upon to ensure message delivery. While the sender queues messages in memory until they're acked by the receiver, when a receiver fails and is either redeployed on another node or restarted on its original node, the receiver essentially begins restarting acking of messages from the point at which the restart/redeploy occurs. That is, the current implementation behaves as if messages sent by the sender while the receiver was offline were successfully received. The reason for this is because tracking acks across receiver failures is an expensive operation, but without also replicating or persisting the messages themselves, the ability of the receiver to recover its point in the message stream for acking would still be vulnerable to a failure from both the sender and receiver simultaneously. That is, even if the messages sent while the receiving side of the connection was offline could be replayed, a failure on both sides of the connection would still result in message loss as long as messages are held in memory.

c) A group of sender and receiver fail.
1) What happens to the messages being processed? Are they resent to other listeners? 

As you can probably gather from my previous answers, messages will be lost during this period of failure. However, since you asked about redelivery to other listeners, I'd actually like to touch on some of the challenges of these types of problems. The problem is because of Vertigo's hash-based routing support there is some additional complexity in simply routing failed messages to other component instances. That is, if the user is using the HashSelector to select connections for a specific stream, that strongly implies that messages *must* be sent to that stream and only that stream. For instance, if some component is counting users by user ID and the stream is partitioned by user ID, a user that's intended for connection 1 based on a simple hashing algorithm obviously cannot be rerouted to connection 2 due to a failure.

Coincidentally, the same issue is present with scaling individual components. Ideally, it would be nice to simply update a network and add instances to a component in order to scale without shutting down a network. However, if a 4 instance component that receives messages based on consistent hashing is scaled to 6 instances, suddenly the hash is not consistent at all.

Of course, these are not problems that are completely unsolvable. Vertigo could certainly support consistent scaling by scaling, for instance, a 4 instance component to 8 instances, and that's certainly on the todo list for Vertigo 1.0.

d) All the servers using Vertigo network fail. 
1) Let's say all the server fail (perhaps a big network failure) and they disconnect and Vertigo cluster is destroyed. What happen with the messages? Are they resent when the Vertigo cluster is recovered?
2) Other scenario: Vertigo's participant die because external reasons (a sysadmin kill all the process, power issues in servers, etc), what happen with the messages being send but still no received? Are they recovered? Same with the messages received by the listeners but still being processed. Are they resend?

I think the first two questions pretty well covered the extent of message reliability in Vertigo, but I have a lot more to say...

I think you've actually brought up some really interesting scenarios that are obviously extremely relevant to the real world, and I'd like to take some time discussing my thoughts on these issues as well as some of the history and future of the project.

While Vertigo is sorely lacking in many areas of reliability, reliability and fault-tolerance are clearly huge and complex problems in distributed systems, and I've been working on these features for Vertigo for almost a year now. In fact, when Vertigo was originally built, I implemented an algorithm basically equivalent to the Storm algorithm, but as Vertigo began to mature and I got some real world experience with the Storm algorithm, I honestly began to completely dislike it. I now view Storm and the Storm acking algorithm as basically lazy in that it delegates much of the hard work to external systems. In my experience, crossing technological boundaries in reliability makes for significantly more complexity. So, with the collaboration of several other Vert.x community members, Vertigo was ultimately refactored to simplify the messaging framework and adhere to a Flow-Based Programming model.

While I am quite satisfied with the directional shift in Vertigo's development, the lack of more advanced fault-tolerance and reliability features has always been an obvious drawback. In my view, Vertigo is essentially useless in a distributed capacity until these types of features can be provided, and that's why I've spent much of this year studying and implementing high level fault-tolerance and messaging features for Vertigo 1.0.

The current Vertigo implementation does indeed provide some level of fault-tolerance for components. However, I believe it is absolutely insufficient for this use case. Essentially, the current implementation closely mimics the behavior of Vert.x's HA feature. When a node fails, the node membership event triggers a redeployment of that node's components on the nearest neighbor. However, I take issue with this HA implementation both in Vert.x and in Vertigo, and here's why...

Both Vert.x and Vertigo rely on Hazelcast for cluster membership detection by default. But the issue is that Hazelcast is an AP system, and it naively expects that every node in the Hazelcast cluster can communicate with every other node. That is, Hazelcast's cluster membership list is maintained by each node individually sending heartbeats to each other node in the cluster. If a network partition occurs, the local Hazelcast instance will trigger a cluster leave event, resulting in both Vert.x and Vertigo redeploying modules from the disconnected node. This could cause unexpected behavior for normal Vert.x components depending on their implementation, but it will most definitely have negative implications for Vertigo components. For instance, such a network partition would cause Vertigo to deploy two instances of the same instance of the same component. Where this could be an issue is with hash-based stream partitioning. If two physical instances of the same logical instance of the same component are deployed in the same cluster at the same time, messages sent to that component will be split among the two real instances. Additionally, as you can probably deduce from my description of the acking algorithm above, message counts will be inconsistent and all messages will fail and be resent. That is, the first instance will receive messages 1, 3, 5, 7, and 9, while the second instance will receive 2, 4, 6, 8, and 10. This would result in every message being failed since the receiving side of the connection would assume that messages were lost. But I digress.

The issue of fault-tolerance for Vertigo components is one I've thought about a lot, and it's why I originally got interested in the Raft consensus algorithm. Really, the Vertigo cluster needs a strongly consistent, partition tolerant coordination framework in order to guarantee that these types of inconsistencies cannot occur, and that's why I've been working on my own Raft implementation which integrates with Vert.x for some time now. In fact, as the scope of both Copycat and Vertigo has grown, and through many long discussions with other Vert.x community members, Raft and Copycat will likely become a much larger part of Vertigo for Vert.x 3, but more on that later.

The other glaring issue with Vertigo - as you pointed out - is reliable messages. While, as I mentioned, I did dislike the Storm acking algorithm, that doesn't mean that Vertigo can't have reliable messaging at all. In fact, since my time experimenting with Storm and its algorithm, I've gained significant interest in Akka's approach to reliability. In short, what's currently being developed for Vertigo 1.0 is much more inline with how persistent messaging is handled in Akka, with some additional features.

I've mentioned a little bit about the future of Vertigo, but now is where I give the full overview. With the release of Vert.x 3 upcoming, we decided to target a huge development push and release of Vertigo 1.0 at around the same time frame. Vert.x 3 resolves a lot of the issues that made development of earlier versions of Vertigo difficult - particularly with regard to polyglot development. Vert.x 3 will bring us code generation for all Vert.x supported languages, and that allows for much more development time to be spent on Vertigo itself.

Up until now I've mostly been working on Copycat and assisting with the development of code generation support in Vert.x core, so there's still a lot to be done, but here are a few of the plans for Vertigo 1.0. Additional features slated for Vertigo 1.0 include request-reply ports, consistent distributed data structures and cluster coordination, event sourced components, persistent replicated messaging, strongly consistent fault/partition tolerance, and fault-tolerant replicated state machines as components. Of course, each of these features will be configurable on a per-network/per-component/per-connection basis. Ultimately, you can say this amounts to at-least-once processing (exactly-once processing is a fantasy).

In summary, don't use Vertigo for reliability right now, but rest assured that it is being worked on constantly!

Thanks for your interest!

Jordan Halterman


--
You received this message because you are subscribed to the Google Groups "vertigo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx-vertig...@googlegroups.com.
To post to this group, send email to vertx-...@googlegroups.com.
Visit this group at http://groups.google.com/group/vertx-vertigo.
To view this discussion on the web visit https://groups.google.com/d/msgid/vertx-vertigo/97520e2b-394e-44da-808c-aaddf73b5553%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pablo Alcaraz

unread,
Dec 24, 2014, 3:53:48 PM12/24/14
to vertx-...@googlegroups.com

Thanks for so detailed answer. Very appreciated.

Pablo

Jordan Halterman

unread,
Dec 26, 2014, 2:47:33 PM12/26/14
to vertx-...@googlegroups.com
Sorry I just saw this. Google marked it as spam for some reason and I just got the notification and cleared it.

Anyways, no problem! I'm always available for low level details and philosophical discussions :-)

Sent from my iPhone

On Dec 24, 2014, at 12:53 PM, Pablo Alcaraz <pa...@stormpath.com> wrote:


Thanks for so detailed answer. Very appreciated.

Pablo

--
You received this message because you are subscribed to the Google Groups "vertigo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx-vertig...@googlegroups.com.
To post to this group, send email to vertx-...@googlegroups.com.
Visit this group at http://groups.google.com/group/vertx-vertigo.
Reply all
Reply to author
Forward
0 new messages