Simple gossip/discovery protocol frameworks.

1,030 views
Skip to first unread message

Kevin Burton

unread,
Oct 28, 2015, 4:46:37 PM10/28/15
to mechanical-sympathy
I need a simple gossip/discovery protocol framework for a project internally.

I basically want a gossip protocol similar to dynamo/cassandra whereboy nodes bootstrap against a known/good set of nodes, then advertise their services, then send heartbeats.

Other nodes can advertise when things are offline or not working.

I could build one my own of course but that takes time.

I think Zookeeper is out of the question due to technical reasons but this might be my plan B.

Marshall Pierce

unread,
Oct 28, 2015, 5:47:43 PM10/28/15
to mechanica...@googlegroups.com
Something like Consul may fit your needs, and if not, be sure to check out its docs for a comparison to other similar tools. 

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg Young

unread,
Oct 28, 2015, 6:22:25 PM10/28/15
to mechanica...@googlegroups.com
If you need a simple gossip protocol why would you need a framework?

Do you need sharding etc?

If not this is a few hundred lines of code. Explanation:

Configure via DNS. DNS contains a list of seed nodes (as many as you
want). Once you have found one you exchange HTTP about your knowledge
of the system with a random entry of your existing, once / 500ms.

I'm not sure why I would want a framework for this unless you started
wanting things like optional consistent queries (yes paxos is hard).
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Ben Alex

unread,
Oct 28, 2015, 6:30:19 PM10/28/15
to mechanica...@googlegroups.com
You mentioned both gossip (which prioritises availability over consistency on the CAP theorem) and also ZooKeeper (which prioritises consistency). So for consistency-requiring use case take a look at the CopyCat introduction for a comparison of several JVM options (Raft-based CopyCat, ZooKeeper, HazelCast). There's also TomP2P for an availability-focused (Kademlia-based) P2P model. See the Jepsen blog for some comparisons of several of products mentioned (including Consul, ZooKeeper and Cassandra) in different network failure scenarios. Don't roll your own unless you really need to; it's tough to build something that properly handles all different failure and conflict resolution conditions.

Brian Toal

unread,
Oct 28, 2015, 6:36:19 PM10/28/15
to mechanica...@googlegroups.com

See https://raft.github.io/ about its consensus protocol and available implementations.

Greg Young

unread,
Oct 28, 2015, 6:37:09 PM10/28/15
to mechanica...@googlegroups.com
sorry but is this about gossip or consensus?

Shevek

unread,
Oct 28, 2015, 6:58:51 PM10/28/15
to mechanica...@googlegroups.com
Consider jgroups. Old, but works just great, and the protocol is
configurable to give the properties you require in a given application,
rather than imposing what it thinks you need.

As a contributor to and previously heavy user of Curator, I wish
ZooKeeper worked, but it doesn't. Avoid. I rather suspect it's dead
anyway. Copycat looks promising, but may be too strong for your needs.

S.
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Kevin Burton

unread,
Oct 28, 2015, 8:09:20 PM10/28/15
to mechanical-sympathy
There seems to be a lot of issues with this system.

Are you sending ALL message to every host?  That's not very efficient.  Additionally, random() is probably NOT what you want.

It's within probability to continually lose every roll of the dice and leave nodes out of the loop.

If I were to roll something myself I would use bootstrap nodes, then build a ring and only advertise to nodes with a value less than my own.

But again I don't to build it necessarily.  

Greg Young

unread,
Oct 28, 2015, 8:13:37 PM10/28/15
to mechanica...@googlegroups.com
"Are you sending ALL message to every host? That's not very
efficient. Additionally, random() is probably NOT what you want."

google gossip protocol please.
>> > email to mechanical-symp...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Studying for the Turing test
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Greg Young

unread,
Oct 28, 2015, 8:14:10 PM10/28/15
to mechanica...@googlegroups.com
and yes a random one is what you want on a network thats fairly small
and homogenous.

Greg Young

unread,
Oct 28, 2015, 8:26:45 PM10/28/15
to mechanica...@googlegroups.com
re-reading your initial post makes me ^^ this.

The fact that you consider gossip protocol and zookeeper as equivalent
makes me think you should explain the problem you are trying to solve
more.

On Thu, Oct 29, 2015 at 2:13 AM, Greg Young <gregor...@gmail.com> wrote:

singh.janmejay

unread,
Oct 29, 2015, 2:39:33 AM10/29/15
to mechanica...@googlegroups.com

Shevek, can you please elaborate a little more on why you say zookeeper and curator won't work (assuming the usecase requires consistency)?

I will have consistency usecase for cluster-state for a project in near future(as in, all nodes should agree upon what cluster state is, and the sequence of events that got it to what it is). I was thinking of zookeeper as a good choice for that(as store for cluster state and for leader election so leader can then manipulate or act upon cluster state).

Kevin, if you are looking for consistency, why is zookeeper not an option(a quick peek into the technical reasons would be very helpful)?

--
Regards,
Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft keyboard sporting it's not-so-smart-assist technology.
   

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Richard Warburton

unread,
Oct 29, 2015, 6:34:20 AM10/29/15
to mechanica...@googlegroups.com
Hi,

I think Zookeeper is out of the question due to technical reasons but this might be my plan

What are the technical reasons? I ask because several other suggestions in this thread aren't zookeeper but use it under the hood so might have similar problems for you.

regards,

  Richard Warburton

Luiz Fernando Teston

unread,
Oct 29, 2015, 6:51:28 AM10/29/15
to mechanica...@googlegroups.com
Just giving my two cents here, I'd go for JGroups given its simplicity.

Richard Warburton

unread,
Oct 29, 2015, 7:07:11 AM10/29/15
to mechanica...@googlegroups.com
Hi,

As a contributor to and previously heavy user of Curator, I wish ZooKeeper worked, but it doesn't. Avoid. I rather suspect it's dead anyway. Copycat looks promising, but may be too strong for your needs.

Zookeeper doesn't look dead to me. Can you expand on "I wish ZooKeeper worked, but it doesn't." No system is a panacea, nothing works for everyone. What didn't work for you, or is your issue bugs?

james bedenbaugh

unread,
Oct 29, 2015, 11:42:22 AM10/29/15
to mechanical-sympathy
+1 jgroups

Josh Humphries

unread,
Oct 29, 2015, 12:01:30 PM10/29/15
to mechanica...@googlegroups.com
Zookeeper is definitely not dead, and is quite good for this sort of use.

The main complaint people have about using Zookeeper is its strong consistency model. Typically, eventual consistent suffices for a service discovery system, and strong consistency means reduced availability in the face of a partition.

However, there is support in 3.4 for a "read-only mode" (marked "experimental", but it's been out and stable for a while). If the server and client are both configured to support read-only mode, then the server will continue to serve read traffic, even if disconnected from the quorum.



----
Josh Humphries
Manager, Shared Systems  |  Platform Engineering
Atlanta, GA  |  678-400-4867

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Shevek

unread,
Oct 29, 2015, 3:14:25 PM10/29/15
to mechanica...@googlegroups.com
The objective for any system is to create a "probability of system
success". While ZooKeeper claims a "guarantee" of consistency, the
failure rate of ZK in practical deployment made the overall probability
of system success be considerably lower than simpler "non-guaranteed"
solutions.

This is all based on 3 years of personal experience, with Java, C and
python clients, so not very reference-able, 12 months out of date, and
rant-y.

ZooKeeper takes a LOT of care and feeding.

If you're not heavily over-provisioned, then a glitch on any node, e.g.
an I/O scheduler hiccup, will cause connection timeouts, which then fail
over, and will cascade across the cluster and can throw the entire
system offline. The necessary overprovisioning is expensive.

If I remember rightly, the connections didn't always time out, block, or
throw as desired, which meant that one gets the wrong answer for e.g. a
database write vs a leader election. ZK doesn't have enough information
to make these decisions, and tends to stop YOUR world whenever it gets
it wrong. One can wrap all ZK calls in Hystrix, of course, but then one
has prevented the local client from breaking like crazy in return for
throwing "consistency" largely out of the window.

A large company (pinterest?) produced a paper saying that in order to
run ZK for service discovery, they put a proxy in front of it to deal
with downtime, at which point, the purpose of using ZK for consistency
is defeated, and one might as well use DNS.

Modifying the server cluster is very painful, and despite claims to the
contrary in ZK 3.5.x, remains a stop-the-world event. It's still hard to
bootstrap, especially compared to C*, and requires another layer of
(manual) turtles to configure the servers.

ZK rolled their own network code, which works just fine, honest.

When trying to create "correctness", there is some ambiguity as to which
exceptions mean 'retry'. e.g. Delete throwing a NoNodeException is NOT a
retry, but delete throwing some other Exception might need a retry, but
does NOT mean that the node wasn't deleted, so your retry might get a
different exception.

When anything breaks in ZK, you're also lost in a maze of twisty NIH
code which doesn't log useful diagnostics, and any bug in a ZK setup
costs a week to diagnose. The build system comes from the stone age, and
the distributions don't help the matter. On the whole, when ZK broke, I
didn't blame any of the engineers who were "washing their hair",
"hoovering the cat" or doing anything else more important.

ZK has watches, which generate event delivery, but given that client
disconnects happen, almost every client polls as well as listening to
watches, at which point one might as well, again, lose the complexity of
ZK and just make a best-effort solution plus polling.

Using ZK sufficiently far under the hood that the majority of change
transactions don't go through ZK often works, but then I have to wonder
whether ZK is still offering any value, or whether it's just "feel-good"
factor that the system is mathematically correct.

Using Cassandra (with astyanax locking or queues), JGroups, or Hazelcast
works much better in practice. I have great hopes of Copycat, but they
require Java 8, at no obviously significant advantage to themselves, at
a point when none of our customers have deployed it. Of them all, I
think I would go to jgroups first, as it has the most configurability
for any particular application, thus making it less likely that one has
to replace the coordination stack entirely. If C* had watches, I would
be in heaven, although C* is fast enough that a single "something has
changed" signal often suffices in practice.

I never plan to use ZooKeeper ever again.

S.

On 10/28/2015 11:39 PM, singh.janmejay wrote:
> Shevek, can you please elaborate a little more on why you say zookeeper
> and curator won't work (assuming the usecase requires consistency)?
>
> I will have consistency usecase for cluster-state for a project in near
> future(as in, all nodes should agree upon what cluster state is, and the
> sequence of events that got it to what it is). I was thinking of
> zookeeper as a good choice for that(as store for cluster state and for
> leader election so leader can then manipulate or act upon cluster state).
>
> Kevin, if you are looking for consistency, why is zookeeper not an
> option(a quick peek into the technical reasons would be very helpful)?
>
> --
> Regards,
> Janmejay
>
> PS: Please blame the typos in this mail on my phone's uncivilized soft
> keyboard sporting it's not-so-smart-assist technology.
>
> On Oct 29, 2015 4:28 AM, "Shevek" <goo...@anarres.org
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>
> <mailto:mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>.

Josh Humphries

unread,
Oct 30, 2015, 11:12:09 AM10/30/15
to mechanica...@googlegroups.com
My team and I have three years of experience operating over a dozen Zookeeper ensembles in production. My own anecdotes differ dramatically.

We've had zero effective downtime in that entire three year period. You didn't state whether you were using a cloud-hosting provider, but that could be a possible issue. ZK, especially in its default/typical configuration, is sensitive to latency between ensemble members. I've heard of the kind of instability you mention, for example, when the cluster is on VMs in EC2. The clusters my team operates are running on our own metal (using linux cgroups for isolation), so we may enjoy performance characteristics that are just more amenable to ZK.

We currently run 3.4.6 in production, but 3.5.1 in dev. (3.5.1 isn't yet stable, but it has a bug-fix to a bad NPE issue that has only been observed on development laptops; we think the issue is related to the computer going to sleep and ZK getting into a confused state on wake-up.)

Moving the cluster is not a stop-the-world event for us, and was not even as far back as 3.3.6. We do it somewhat regularly (several times per year) and with zero downtime. It does require a careful process, involving updates to round-robin DNS configuration as well as a deploy of updated configuration for the ZK servers. And if you are moving a quorum (or more) of hosts, you will have to restart clients in the middle of the process. This latter issue is a bug in the client library -- once it resolves DNS, it never tries to re-resolve (until restart). Admittedly, having to restart clients is terrible, but in practice we've never actually had to move that many nodes at once so have never had to take those measures.

As far as availability vs. consistency, see my previous message about "read-only mode" (added as an experimental feature in 3.4). To be honest, we don't actually run using read-only mode for our service discovery. We do plan on using that feature since we would prefer availability over consistency (eventual consistency suffices). But making that config change has been low priority because we've never had an incident where the standard configuration was a problem and a cluster couldn't reach quorum (knocking on wood...).

Regarding errors and how to know when to retry in the client: the ZK client library is indeed a big leaky abstraction. Getting the client implementation correct and robust in the face of every kind of failure scenario isn't easy. However, Curator implements several recipes that provide better abstractions, including handling errors in a sensible and correct way. These recipes are great building blocks. Recipes like NodeCache and TreeCache handle retrying operations until they eventually succeed and also properly handle session events and watches, most importantly: re-creating watches on session expiry.

As far as watches vs. polling: client disconnects do not mean clients need to resort to polling. You only need to query when you get an event indicating that your session expired (like during a prolonged disconnection), just as if the watch had actually triggered. You can configure the session expiry so that watches survive intermittent disconnects. Depending on how your service discovery works, there is some tension on configuring the session timeout. You want it large enough so that sessions aren't expired frequently in the course of typical operations (e.g. even if a client experiences a full GC). That is at odds with the use of ephemeral nodes, where you generally want the timeout small so that ephemeral nodes are cleaned up in a timely manner. Our service discovery implementation doesn't use ephemeral nodes; we use a session expiry of 30 seconds.



----
Josh Humphries
Manager, Shared Systems  |  Platform Engineering
Atlanta, GA  |  678-400-4867

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages