I think the first two questions pretty well covered the extent of message reliability in Vertigo, but I have a lot more to say...
I think you've actually brought up some really interesting scenarios that are obviously extremely relevant to the real world, and I'd like to take some time discussing my thoughts on these issues as well as some of the history and future of the project.
While Vertigo is sorely lacking in many areas of reliability, reliability and fault-tolerance are clearly huge and complex problems in distributed systems, and I've been working on these features for Vertigo for almost a year now. In fact, when Vertigo was originally built, I implemented an algorithm basically equivalent to the
Storm algorithm, but as Vertigo began to mature and I got some real world experience with the Storm algorithm, I honestly began to completely dislike it. I now view Storm and the Storm acking algorithm as basically lazy in that it delegates much of the hard work to external systems. In my experience, crossing technological boundaries in reliability makes for significantly more complexity. So, with the collaboration of several other Vert.x community members, Vertigo was ultimately refactored to simplify the messaging framework and adhere to a Flow-Based Programming model.
While I am quite satisfied with the directional shift in Vertigo's development, the lack of more advanced fault-tolerance and reliability features has always been an obvious drawback. In my view, Vertigo is essentially useless in a distributed capacity until these types of features can be provided, and that's why I've spent much of this year studying and implementing high level fault-tolerance and messaging features for Vertigo 1.0.
The current Vertigo implementation does indeed provide some level of fault-tolerance for components. However, I believe it is absolutely insufficient for this use case. Essentially, the current implementation closely mimics the behavior of Vert.x's HA feature. When a node fails, the node membership event triggers a redeployment of that node's components on the nearest neighbor. However, I take issue with this HA implementation both in Vert.x and in Vertigo, and here's why...
Both Vert.x and Vertigo rely on Hazelcast for cluster membership detection by default. But the issue is that Hazelcast is an AP system, and it naively expects that every node in the Hazelcast cluster can communicate with every other node. That is, Hazelcast's cluster membership list is maintained by each node individually sending heartbeats to each other node in the cluster. If a network partition occurs, the local Hazelcast instance will trigger a cluster leave event, resulting in both Vert.x and Vertigo redeploying modules from the disconnected node. This could cause unexpected behavior for normal Vert.x components depending on their implementation, but it will most definitely have negative implications for Vertigo components. For instance, such a network partition would cause Vertigo to deploy two instances of the same instance of the same component. Where this could be an issue is with hash-based stream partitioning. If two physical instances of the same logical instance of the same component are deployed in the same cluster at the same time, messages sent to that component will be split among the two real instances. Additionally, as you can probably deduce from my description of the acking algorithm above, message counts will be inconsistent and all messages will fail and be resent. That is, the first instance will receive messages 1, 3, 5, 7, and 9, while the second instance will receive 2, 4, 6, 8, and 10. This would result in every message being failed since the receiving side of the connection would assume that messages were lost. But I digress.
The issue of fault-tolerance for Vertigo components is one I've thought about a lot, and it's why I originally got interested in the
Raft consensus algorithm. Really, the Vertigo cluster needs a strongly consistent, partition tolerant coordination framework in order to guarantee that these types of inconsistencies cannot occur, and that's why I've been working on
my own Raft implementation which integrates with Vert.x for some time now. In fact, as the scope of both Copycat and Vertigo has grown, and through many long discussions with other Vert.x community members, Raft and Copycat will likely become a much larger part of Vertigo for Vert.x 3, but more on that later.
The other glaring issue with Vertigo - as you pointed out - is reliable messages. While, as I mentioned, I did dislike the Storm acking algorithm, that doesn't mean that Vertigo can't have reliable messaging at all. In fact, since my time experimenting with Storm and its algorithm, I've gained significant interest in
Akka's approach to reliability. In short, what's currently being developed for Vertigo 1.0 is much more inline with how persistent messaging is handled in Akka, with some additional features.
I've mentioned a little bit about the future of Vertigo, but now is where I give the full overview. With the release of Vert.x 3 upcoming, we decided to target a huge development push and release of Vertigo 1.0 at around the same time frame. Vert.x 3 resolves a lot of the issues that made development of earlier versions of Vertigo difficult - particularly with regard to polyglot development. Vert.x 3 will bring us code generation for all Vert.x supported languages, and that allows for much more development time to be spent on Vertigo itself.
Up until now I've mostly been working on Copycat and assisting with the development of code generation support in Vert.x core, so there's still a lot to be done, but here are a few of the plans for Vertigo 1.0. Additional features slated for Vertigo 1.0 include request-reply ports, consistent distributed data structures and cluster coordination, event sourced components, persistent replicated messaging, strongly consistent fault/partition tolerance, and fault-tolerant replicated state machines as components. Of course, each of these features will be configurable on a per-network/per-component/per-connection basis. Ultimately, you can say this amounts to at-least-once processing (exactly-once processing is a fantasy).
In summary, don't use Vertigo for reliability right now, but rest assured that it is being worked on constantly!
Thanks for your interest!
Jordan Halterman