Reliability and message ordering for 3+ nodes?

32 views

Skip to first unread message

Alberto G. Corona

unread,

Feb 7, 2013, 8:26:37 AM2/7/13

to cloud-haskel...@googlegroups.com, cloudh...@googlegroups.com

I have an application in which the ordering of messages is important. I know that among two nodes, the receiver will receive all the messages and will receive them ordered.

But what happens when two or more nodes are sending messages to a central node where what happens depends on the order (and ever in the timing of them) ?. Think for example a reverse auction service where the first bid get the lot. Imagine that there are N bidders in N nodes. (in reverse auctions,many bids can be sent in less than a second)

The question is: do the messages arrive ordered by a timestamp put in the source node?.

If not, could it be implemented in the future as a inprovement, a "transactional service" for example?

This is also important for the implementation of synchronization and failover. clustering, distributed databases And other kinds of programming paradigms, like event sourcing.

Scenario 1

Imagine that I want one or more mirror nodes, to be used just on case of failure of the main node. To do so, I forward the messages to the mirror nodes, so the state of them stay synchronized with the main node.

Scenario 2

Instead of Active-inactive synchronization for failover, The mirror nodes can be active, so they receive request simultaneously from a load distribution service, each one having a mirrored state, for example a distributed database of books in Amazon. or, more critical, some bank accounts. Then the messages must be transmitted and re-retransmitted among the cluster respecting the original order in which the clients sent the original messages.

I think that does make sense to provide this service to cover these scenario. at the framework level, rather than at the application level.

I´m going too fast. Maybe there are something already developed for this or a more simple standardized solution

Alberto.

Alberto G. Corona

unread,

Feb 7, 2013, 11:31:49 AM2/7/13

to Tim Watson, cloud-haskel...@googlegroups.com, cloudh...@googlegroups.com

Tim,

Not in the base layer of course. It does not makes sense. But as an additional layer somehow. I´´ take a look at the

How the

At the application level, the latency of the locks can be aleviated by locking only the messages where order is important (I call them "non commutative") and minimizing these kinds of messages by design. Also messages (p.ej. transactions) can be fused, if they are instances of monoid.

I did some experimentation with a pre-cloud haskell development time ago form my own. I´ll describe what I did for these scenarios:

I defined an absolute order relation in the events with hostid+ timestamp+sequence (real timestamp may be important for some applications). A node when no message was available to send, it sent a token, containing the older current stamp of the nodes connected to it. the nodes that do not sent a token blocked automatically the execution of non commutative mesages, and this block was propagated to all the nodes in the cloud. After a timeout, this node was disconnected. When all the nodes sent their tokens, then the execution of the messages older than the older token were executed, since the order was consolidated. this was the protocol for ordering them.

There was a protocol for reconnection, where all their pending non-commutative messages of the node were discarded, and the commutative ones were forwarded. Commutative messages were executed at full speed without locking. No broadcast was used, but only forwarding. the latency depended on the token ticks and the number of forwardings.

To expose to the application programmer the possibility to avoid the bottleneck produced by non commutative message enforces a better design. That was what I though. For example, by using modification messages (events) instead of setting messages and so on. Partitions of the network were not solved, but both partitions could operate its commutative messages, while non-commutative ones reported errors for the producers until re-synchronisation. There where no master and no slaves.

I did not finished the implementation however and my plan was to implement it over cloud haskell.

2013/2/7 Tim Watson <watson....@gmail.com>

Hi Alberto!

On 7 Feb 2013, at 13:26, Alberto G. Corona wrote:

> I have an application in which the ordering of messages is important. I know that among two nodes, the receiver will receive all the messages and will receive them ordered.
>

The ordering guarantee is between two *processes* rather than between nodes. If P1 sends to P2 then all messages will either be delivered in order, or not at all. If P1 and P2 are on different nodes and the network becomes disconnected, then P2 has to explicitly `reconnect' to receive further messages and in doing so, acknowledges the possibility of lost messages and/or ordering.

> But what happens when two or more nodes are sending messages to a central node where what happens depends on the order (and ever in the timing of them) ?. Think for example a reverse auction service where the first bid get the lot. Imagine that there are N bidders in N nodes. (in reverse auctions,many bids can be sent in less than a second)
>
> The question is: do the messages arrive ordered by a timestamp put in the source node?.
>

No. Ordering between more than two processes, P1 *and* P2 both sending to P3 - is undefined, even (or perhaps, especially!) if all three processes reside on the same node, let alone three different ones! This is exactly how Erlang does it, except a bit better because Erlang *can* sometimes break the ordering guarantee between two (peer) processes in the face of network outages and fast automatic reconnects.

> If not, could it be implemented in the future as a inprovement, a "transactional service" for example?
>

Not at the distributed-process library level, no. The overhead of synchronising a group of N distributed senders to guarantee total ordering in a single receiver is quite a bit higher than you might imagine.

Having said that....

> This is also important for the implementation of synchronization and failover. clustering, distributed databases And other kinds of programming paradigms, like event sourcing.
>

Indeed this is a useful thing to have, but as I said, this is not something that the distributed-process layer should be doing. Erlang doesn't do this, but distributed databases written in Erlang, such as Riak, add this layer on top using vector clocks. I'd be happy to integrate this capability into Cloud Haskell's distributed-process-platform as an optional feature. Pull requests are most welcome! ;)

> Scenario 1
> Imagine that I want one or more mirror nodes, to be used just on case of failure of the main node. To do so, I forward the messages to the mirror nodes, so the state of them stay synchronized with the main node.
>
> Scenario 2
> Instead of Active-inactive synchronization for failover, The mirror nodes can be active, so they receive request simultaneously from a load distribution service, each one having a mirrored state, for example a distributed database of books in Amazon. or, more critical, some bank accounts. Then the messages must be transmitted and re-retransmitted among the cluster respecting the original order in which the clients sent the original messages.
>

Both of these require a lot more than just augmenting the node receiver queue with ordering guarantees. See for example, https://github.com/rabbitmq/rabbitmq-server/blob/master/src/gm.erl and then the essay near the top of https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_mirror_queue_coordinator.erl#L50.

Incidentally, do you know of any existing messaging infrastructure technologies that *do* offer ordering guarantees that hold over multiple distributed senders? I work on messaging technology for a living (at https://rabbitmq.com) and I've not come across this. You can guarantee the ordering between two endpoints in a messaging infrastructure, no more. If you want to order particular clients with respect to timestamps, then you've got to implement that yourself in the processing nodes outside the messaging backbone.

> I think that does make sense to provide this service to cover these scenario. at the framework level, rather than at the application level.
>

I agree and disagree at the same time! Ha ha ha. :D

This kind of think *does* belong in a framework, but not the base level Cloud Haskell library. To do this properly requires vector clocks, and the cost of doing that is actually very high. Not every application needs this, so putting into the base Cloud Haskell layer not only complicates the semantics significantly, but it has a terrible effect on performance for those that do not need such strong guarantees.

Doing this at a level above distributed-process is fine though - see Jeff's https://github.com/jepst/distributed-process-global for example, which offers cluster control and global locking. But if you go ahead and put global locks around all send operations throughout your cluster, do please let me know how it performs at runtime - my expectation is that you'll have terrible throughput.

> I´m going too fast. Maybe there are something already developed for this or a more simple standardized solution
>

I won't pretent to be a distributed systems expert - although Edsko probably does count as one of those - but in my somewhat limited experience in this area over the years, I've found that problems such as this are often underestimated and assumed to be easy to solve. A vclock *will* solve some of these problems, but introduces other problems, as it facilitates Availability and Partition Tolerance but not Consistency. Consistency requires global synchronisation, which implies transactions, which implies something a la Paxos and friends. We have an open issue to implement distributed transactions in distributed-process-platform already: https://cloud-haskell.atlassian.net/browse/DPP-40 - we do not have an open issue to figure out how to do global/distributed deadlock detection, but then Erlang's gen_leader hasn't solved that problem yet either. And as I said, making 'send' a global transaction is not likely to yield very nice throughput.

Cheers,
Tim

--
Alberto.

Alberto G. Corona

unread,

Feb 7, 2013, 1:39:57 PM2/7/13

to Tim Watson, cloud-haskel...@googlegroups.com, cloudh...@googlegroups.com

Tim,

It was just a research of my own. I´m, not sure of the formal semantics. The prototype that I did was just to validate the idea and still I´m not sure.

Broadcast here is virtual, trough forwarding, this is more decoupled and flexible, and real broadcast can "almost" exist if all are connected with all. I imagine it as a wave propagation of ordered events. I reached the conclussion that distributed transactions coordination is a particular case once we make sure that upon reception of a token of timestamp t, the node has received the previous messages from the other node. so he has the same state and he can perform the transactions. So in a certain way, each node is a supervisor of the other nodes. But i repeat, I have to finish the research. (I only read the literature when I verify that something is not possible or I run out of ideas ;)

The need to avoid a master was a requirement for verificability. theoretically, any new node can connect (ever after execution) and receive and rerun the complete set of events to verify the final state. This is important in my case.

I will not try it sometime this year. Not now. so I will ask then for the concrete details of implementation. If no other want to try it in the meantime.

2013/2/7 Tim Watson <watson....@gmail.com>

Alberto,

It sounds like an interesting design, though I'm unsure what the formal semantics would look like. I'm not aware for example, of any leaderless transaction management protocols - can you point me at some literature here? I'd be interested to learn.

Without synchronisation between peers, i.e., transactions, I cannot see how this would work without a broadcast protocol. As I say, I've not heard of any transaction management protocols that don't require a coordinator.

With a directed graph arranged as a ring overlay, I imagine that you could achieve lock-less ordering guarantees amongst all peers at the cost of 2*n hops per message as we do in gm. The cost ends up being state transfer in either case.

I'd be delighted to see something that adds new distributed algorithms on top of cloud Haskell of course. We have numerous open issues that require coordination amongst peers already. Process Groups, mirrored supervisors and group services (all filed against d-p-platform) all require this. I am planning to use distributed-process-global for this, but I'm unable to run the test suite on osx and haven't had time to figure out why.

Of course if you do implement this, the ordering will have to be enforced by an insulating process paired with the actual receiver. It might be worth adding this as a layer on top of the ManagedProcess API. How that is evolving is that you have one or more behaviours which are layered together to form a ProcessDefinition that defines the handlers for different kinds of messages. The order of the handler declarations determines the processing order once messages arrive, as these are passed to a selective receive. If your layer provides the ordering checks transparently, then it could be simply layered in as a behaviour. The handlers could either deal with coordination directly or cooperate with a registered process. Or you could use and insulator process of course.

That wont work until 0.5.0 is released however, or the outstanding pull request is merged into development (branch) of d-p as the behaviour API depends on Message being serializable and consumable by user code.

Cheers
Tim