Changes in how a node joins the cluster

Skip to first unread message

Paddy Foran

Jan 19, 2013, 6:58:16 PM1/19/13
As was kindly pointed out to me[1], timestamps in a distributed application are a Bad IdeaTM. At the moment, Wendy uses timestamps to detect race conditions. To quote from the original paper:

Pastry uses an optimistic approach to controlling concurrent node arrivals and departures. Since the arrival/departure of a node affects only a small number of existing nodes in the system, contention is rare and an optimistic approach is appropriate. Briefly, whenever a node A provides state information to a node B, it attaches a timestamp to the message. B adjusts its own state based on this information and eventually sends an update message to A (e.g., notifying A of its arrival). B attaches the original timestamp, which allows A to check if its state has since changed. In the event that its state has changed, it responds with its updated state and B restarts its operation.

Essentially, if two nodes join at the same time, or a node leaves while another node is in the process of joining, the state table of the joining node can get out of sync. Which is bad. To that end, a timestamp is included whenever a node sends another node state, and the receiving node sends that timestamp back after it finishes modifying its state. If the sending node has changed since that timestamp, it sends its new state to the receiving node.

Or at least that's how it's supposed to work--the current implementation is wrong[2].

With the unreliability of timestamps, however, we need a new way to keep track of whether a node's state has changed or not. Vector clocks were recommended to me, in lieu of timestamps, but those are rather complicated and there are several important tradeoffs between retaining information and keeping the memory footprint low that need to be made, and I'm not sure they're worth the level of complexity they add. My current plan is to just give each state table its own version counter that increments whenever the state table changes. That counter would then be embedded in lieu of the timestamp.

Any objections to that simplification? I believe it stores the required information in the simplest possible way to perform the required function.

Another thing I'm looking into is implementing the neighborhood set, as originally mentioned in the paper, and discontinuing the practice of storing every node. As Evan Shaw pointed out, the cost of storing so many nodes gets prohibitive (as can be seen in the benchmarks), and the paper allows to store only about log base 16 of the nodes in the cluster while still functioning normally. This means our primitive implementation of the algorithm for when a node joins the cluster would need to be thrown out, and replaced with the algorithm actually described in the paper, which is significantly more complex. The extra work is, however, rewarded with a piece of software that can trivially handle millions of connected nodes.

I'm currently working on these changes in a branch titled "paddyforan-neighborhood". It will be periodically pushed to Github, in case anyone is interested in checking the progress or doing code review.


Reply all
Reply to author
0 new messages