The objective for any system is to create a "probability of system
success". While ZooKeeper claims a "guarantee" of consistency, the
failure rate of ZK in practical deployment made the overall probability
of system success be considerably lower than simpler "non-guaranteed"
solutions.
This is all based on 3 years of personal experience, with Java, C and
python clients, so not very reference-able, 12 months out of date, and
rant-y.
ZooKeeper takes a LOT of care and feeding.
If you're not heavily over-provisioned, then a glitch on any node, e.g.
an I/O scheduler hiccup, will cause connection timeouts, which then fail
over, and will cascade across the cluster and can throw the entire
system offline. The necessary overprovisioning is expensive.
If I remember rightly, the connections didn't always time out, block, or
throw as desired, which meant that one gets the wrong answer for e.g. a
database write vs a leader election. ZK doesn't have enough information
to make these decisions, and tends to stop YOUR world whenever it gets
it wrong. One can wrap all ZK calls in Hystrix, of course, but then one
has prevented the local client from breaking like crazy in return for
throwing "consistency" largely out of the window.
A large company (pinterest?) produced a paper saying that in order to
run ZK for service discovery, they put a proxy in front of it to deal
with downtime, at which point, the purpose of using ZK for consistency
is defeated, and one might as well use DNS.
Modifying the server cluster is very painful, and despite claims to the
contrary in ZK 3.5.x, remains a stop-the-world event. It's still hard to
bootstrap, especially compared to C*, and requires another layer of
(manual) turtles to configure the servers.
ZK rolled their own network code, which works just fine, honest.
When trying to create "correctness", there is some ambiguity as to which
exceptions mean 'retry'. e.g. Delete throwing a NoNodeException is NOT a
retry, but delete throwing some other Exception might need a retry, but
does NOT mean that the node wasn't deleted, so your retry might get a
different exception.
When anything breaks in ZK, you're also lost in a maze of twisty NIH
code which doesn't log useful diagnostics, and any bug in a ZK setup
costs a week to diagnose. The build system comes from the stone age, and
the distributions don't help the matter. On the whole, when ZK broke, I
didn't blame any of the engineers who were "washing their hair",
"hoovering the cat" or doing anything else more important.
ZK has watches, which generate event delivery, but given that client
disconnects happen, almost every client polls as well as listening to
watches, at which point one might as well, again, lose the complexity of
ZK and just make a best-effort solution plus polling.
Using ZK sufficiently far under the hood that the majority of change
transactions don't go through ZK often works, but then I have to wonder
whether ZK is still offering any value, or whether it's just "feel-good"
factor that the system is mathematically correct.
Using Cassandra (with astyanax locking or queues), JGroups, or Hazelcast
works much better in practice. I have great hopes of Copycat, but they
require Java 8, at no obviously significant advantage to themselves, at
a point when none of our customers have deployed it. Of them all, I
think I would go to jgroups first, as it has the most configurability
for any particular application, thus making it less likely that one has
to replace the coordination stack entirely. If C* had watches, I would
be in heaven, although C* is fast enough that a single "something has
changed" signal often suffices in practice.
I never plan to use ZooKeeper ever again.
S.
On 10/28/2015 11:39 PM, singh.janmejay wrote:
> Shevek, can you please elaborate a little more on why you say zookeeper
> and curator won't work (assuming the usecase requires consistency)?
>
> I will have consistency usecase for cluster-state for a project in near
> future(as in, all nodes should agree upon what cluster state is, and the
> sequence of events that got it to what it is). I was thinking of
> zookeeper as a good choice for that(as store for cluster state and for
> leader election so leader can then manipulate or act upon cluster state).
>
> Kevin, if you are looking for consistency, why is zookeeper not an
> option(a quick peek into the technical reasons would be very helpful)?
>
> --
> Regards,
> Janmejay
>
> PS: Please blame the typos in this mail on my phone's uncivilized soft
> keyboard sporting it's not-so-smart-assist technology.
>
> On Oct 29, 2015 4:28 AM, "Shevek" <
goo...@anarres.org
> <mailto:
mechanical-sympathy%2Bunsu...@googlegroups.com>
> <mailto:
mechanical-symp...@googlegroups.com
> <mailto:
mechanical-sympathy%2Bunsu...@googlegroups.com>>.
> <mailto:
mechanical-sympathy%2Bunsu...@googlegroups.com>.