Replace ZooKeeper with Consul

803 views
Skip to first unread message

Roman Leventov

unread,
Nov 14, 2017, 4:43:06 PM11/14/17
to druid-de...@googlegroups.com
Reviving this issue: https://github.com/druid-io/druid/issues/1263

Consul seems to be superb to ZooKeeper in many aspects: https://www.consul.io/intro/vs/zookeeper.html

We could also leverage it's key-value store for indexing locks.

Gian Merlino

unread,
Nov 14, 2017, 8:40:51 PM11/14/17
to druid-de...@googlegroups.com
I had always hoped that one day we could use Raft on the Druid Coordinators, maybe through something like Copycat: http://atomix.io/copycat/

Then there won't be a dependency on an external system, and we could roll in the metadata store too, with some benefits:

- No dependency on ZK/Consul/Etcd or a metadata database.
- Eliminates a class of bugs where updates to the metadata store are made from an overlord / coordinator that is not actually the leader (due to race between ZK-based leadership election and metadata writes).
- Eliminates need for overlord / coordinator to cache metadata items like segments, active tasks, task locks, etc in memory; since we can structure the Raft-based state store to itself be the cache.

However I would also be supportive of any work to generalize Druid's coordination layer to support using either Zk or Consul. That same work could be used to help build a builtin implementation later.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwfzZ3f9EkWHFZdOfKfpZ9xyRBKPmirezu7XGki_UjUYPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Roman Leventov

unread,
Nov 14, 2017, 10:40:30 PM11/14/17
to druid-de...@googlegroups.com
Does Copycat suit for Coordinator and/or Overlord state management, locks and leader election, or for service discovery as well?

Also a nice thing about Consul is that it has Web UI out of the box.

On Tue, Nov 14, 2017 at 10:40 PM, Gian Merlino <gi...@imply.io> wrote:
I had always hoped that one day we could use Raft on the Druid Coordinators, maybe through something like Copycat: http://atomix.io/copycat/

Then there won't be a dependency on an external system, and we could roll in the metadata store too, with some benefits:

- No dependency on ZK/Consul/Etcd or a metadata database.
- Eliminates a class of bugs where updates to the metadata store are made from an overlord / coordinator that is not actually the leader (due to race between ZK-based leadership election and metadata writes).
- Eliminates need for overlord / coordinator to cache metadata items like segments, active tasks, task locks, etc in memory; since we can structure the Raft-based state store to itself be the cache.

However I would also be supportive of any work to generalize Druid's coordination layer to support using either Zk or Consul. That same work could be used to help build a builtin implementation later.

Gian

On Tue, Nov 14, 2017 at 1:43 PM, Roman Leventov <roman.leventov@metamarkets.com> wrote:
Reviving this issue: https://github.com/druid-io/druid/issues/1263

Consul seems to be superb to ZooKeeper in many aspects: https://www.consul.io/intro/vs/zookeeper.html

We could also leverage it's key-value store for indexing locks.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwfzZ3f9EkWHFZdOfKfpZ9xyRBKPmirezu7XGki_UjUYPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Himanshu Gupta

unread,
Nov 14, 2017, 11:15:24 PM11/14/17
to Druid Development
I have been working slowly to reduce the dependency of Druid on consensus based system in general to bare minimum so that consensus implementation is used *only* for node discovery and leader election.
My goal is that it should be made possible to write extensions to use the consensus implementation of your choice by implementing just two abstractions https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/discovery/DruidLeaderSelector.java and https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/discovery/DruidNodeDiscoveryProvider.java .

I am creating alternatives for other state management that original had hard dependencies on zookeeper. All the alternatives use a combination of mentioned abstractions for discovering nodes and then use HTTP to interact directly with other nodes instead of depending on zookeeper or anything like that.
- Segment Load/Drop Management at Coordinator ( DONE: See https://github.com/druid-io/druid/pull/4997 )
- Segment Discovery at Coordinator/Broker ( DONE: See https://github.com/druid-io/druid/pull/4997 )
- Node discovery e.g. Router discovering Brokers, Coordinator discovering lookup nodes etc etc has been moved to use abstractions mentioned before. (DONE)
- Overlord task management ( Proposal: https://github.com/druid-io/druid/issues/4996 , WIP PR would be available soon)

All the code that has hard dependency on Zookeeper (aside from two abstractions I mentioned) is deprecated, being kept only for backward compatibility and till above alternatives prove their worth in production. Note that I'm running all of the above alternatives on our internal metrics cluster except for overlord task management which is a work in progress at this time.

Also, I have already added curator based implementations in Druid core for above abstractions and working on copycat based extension. I'm hopeful to run our internal metrics cluster without zookeeper or any other external consensus implementation dependency by the end of Q1-2018 .

At this point, Aside from node discovery and leader election, I would be reluctant to add more dependency on consensus based system so that writing extensions for consensus system of your choice stays very simple and also no persistent state needs to be stored there. So, We should try to find alternatives like we've done for segment/task management instead of adding more dependency on consensus protocol.

Personally, I have never heard metadata store being a problem except for marketing of Druid. However, I think it would be possible to even remove metadata store by just depending on node discovery and leader election from consensus system.

-- Himanshu

Himanshu Gupta

unread,
Nov 14, 2017, 11:30:38 PM11/14/17
to Druid Development
Forgot to mention another abstraction that would need to be implemented in the extension: https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/discovery/DruidNodeAnnouncer.java

Gian Merlino

unread,
Nov 15, 2017, 1:27:56 AM11/15/17
to druid-de...@googlegroups.com
I'm not too familiar with Copycat, it's just something I found when searching for "java raft library". It seems to be embeddable, so it could run on the coordinators rather than as a separate service.

Since Raft is just a protocol for consensus agreement on replicated state, it would work with any state machine that behaves deterministically. So the state machine could be something ZK-like, or key/value, or even a Derby database. I'm not sure if Copycat in particular supports arbitrary state machines, but I would hope that it does. I would imagine using it for locks, leader election, service discovery, and metadata -- allowing us to merge ZK, metadata store, and coordinator into one.

Gian

On Tue, Nov 14, 2017 at 7:40 PM, Roman Leventov <roman.l...@metamarkets.com> wrote:
Does Copycat suit for Coordinator and/or Overlord state management, locks and leader election, or for service discovery as well?

Also a nice thing about Consul is that it has Web UI out of the box.
On Tue, Nov 14, 2017 at 10:40 PM, Gian Merlino <gi...@imply.io> wrote:
I had always hoped that one day we could use Raft on the Druid Coordinators, maybe through something like Copycat: http://atomix.io/copycat/

Then there won't be a dependency on an external system, and we could roll in the metadata store too, with some benefits:

- No dependency on ZK/Consul/Etcd or a metadata database.
- Eliminates a class of bugs where updates to the metadata store are made from an overlord / coordinator that is not actually the leader (due to race between ZK-based leadership election and metadata writes).
- Eliminates need for overlord / coordinator to cache metadata items like segments, active tasks, task locks, etc in memory; since we can structure the Raft-based state store to itself be the cache.

However I would also be supportive of any work to generalize Druid's coordination layer to support using either Zk or Consul. That same work could be used to help build a builtin implementation later.

Gian

On Tue, Nov 14, 2017 at 1:43 PM, Roman Leventov <roman.l...@metamarkets.com> wrote:
Reviving this issue: https://github.com/druid-io/druid/issues/1263

Consul seems to be superb to ZooKeeper in many aspects: https://www.consul.io/intro/vs/zookeeper.html

We could also leverage it's key-value store for indexing locks.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwfzZ3f9EkWHFZdOfKfpZ9xyRBKPmirezu7XGki_UjUYPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYBcDmosSJa2QC-Q9YGvALbOmFVHok2mk%2BQ9zt_Ns0%3DjTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Nov 15, 2017, 1:40:18 AM11/15/17
to druid-de...@googlegroups.com
Himanshu- thanks for laying out your plans in one spot. It's helpful. I didn't realize you were already working on something using copycat. Have you worked on it enough to know if it's "nice" (whatever that means -- performance, reliability, ease of integration with Druid)?

Personally, I have never heard metadata store being a problem except for marketing of Druid.

Marketing is important too! But there's also a technical benefit, elimination of race conditions and caching overheads between metadata storage and ZK. In particular I think TaskLockbox, TaskQueue, TaskStorage could be simplified substantially if they didn't have to try to worry about keeping their in-memory data and the metadata storage data in sync. Probably TaskRunner too.

Gian

On Tue, Nov 14, 2017 at 8:15 PM, Himanshu Gupta <g.him...@gmail.com> wrote:
I have been working slowly to reduce the dependency of Druid on consensus based system in general to bare minimum so that consensus implementation is used *only* for node discovery and leader election.
My goal is that it should be made possible to write extensions to use the consensus implementation of your choice by implementing just two abstractions https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/discovery/DruidLeaderSelector.java and https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/discovery/DruidNodeDiscoveryProvider.java .

I am creating alternatives for other state management that original had hard dependencies on zookeeper. All the alternatives use a combination of mentioned abstractions for discovering nodes and then use HTTP to interact directly with other nodes instead of depending on zookeeper or anything like that.
- Segment Load/Drop Management at Coordinator ( DONE: See https://github.com/druid-io/druid/pull/4997 )
- Segment Discovery at Coordinator/Broker ( DONE: See https://github.com/druid-io/druid/pull/4997 )
- Node discovery e.g. Router discovering Brokers, Coordinator discovering lookup nodes etc etc has been moved to use abstractions mentioned before. (DONE)
- Overlord task management ( Proposal: https://github.com/druid-io/druid/issues/4996 , WIP PR would be available soon)

All the code that has hard dependency on Zookeeper (aside from two abstractions I mentioned) is deprecated, being kept only for backward compatibility and till above alternatives prove their worth in production. Note that I'm running all of the above alternatives on our internal metrics cluster except for overlord task management which is a work in progress at this time.

Also, I have already added curator based implementations in Druid core for above abstractions and working on copycat based extension. I'm hopeful to run our internal metrics cluster without zookeeper or any other external consensus implementation dependency by the end of Q1-2018 .

At this point, Aside from node discovery and leader election, I would be reluctant to add more dependency on consensus based system so that writing extensions for consensus system of your choice stays very simple and also no persistent state needs to be stored there. So, We should try to find alternatives like we've done for segment/task management instead of adding more dependency on consensus protocol.

Personally, I have never heard metadata store being a problem except for marketing of Druid. However, I think it would be possible to even remove metadata store by just depending on node discovery and leader election from consensus system.

-- Himanshu




On Tuesday, 14 November 2017 21:40:30 UTC-6, Roman Leventov wrote:
Does Copycat suit for Coordinator and/or Overlord state management, locks and leader election, or for service discovery as well?

Also a nice thing about Consul is that it has Web UI out of the box.
On Tue, Nov 14, 2017 at 10:40 PM, Gian Merlino <gi...@imply.io> wrote:
I had always hoped that one day we could use Raft on the Druid Coordinators, maybe through something like Copycat: http://atomix.io/copycat/

Then there won't be a dependency on an external system, and we could roll in the metadata store too, with some benefits:

- No dependency on ZK/Consul/Etcd or a metadata database.
- Eliminates a class of bugs where updates to the metadata store are made from an overlord / coordinator that is not actually the leader (due to race between ZK-based leadership election and metadata writes).
- Eliminates need for overlord / coordinator to cache metadata items like segments, active tasks, task locks, etc in memory; since we can structure the Raft-based state store to itself be the cache.

However I would also be supportive of any work to generalize Druid's coordination layer to support using either Zk or Consul. That same work could be used to help build a builtin implementation later.

Gian

On Tue, Nov 14, 2017 at 1:43 PM, Roman Leventov <roman.l...@metamarkets.com> wrote:
Reviving this issue: https://github.com/druid-io/druid/issues/1263

Consul seems to be superb to ZooKeeper in many aspects: https://www.consul.io/intro/vs/zookeeper.html

We could also leverage it's key-value store for indexing locks.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwfzZ3f9EkWHFZdOfKfpZ9xyRBKPmirezu7XGki_UjUYPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYBcDmosSJa2QC-Q9YGvALbOmFVHok2mk%2BQ9zt_Ns0%3DjTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Himanshu Gupta

unread,
Nov 15, 2017, 2:42:46 PM11/15/17
to Druid Development
Copycat ( http://atomix.io/ ) has more docs about it but "copycat" is a java implementation of RAFT and provides StateMachine abstraction to manage one replicated log across nodes participating in raft protocol. Such replicated log can be used to facilitate maintaining a consistent state across those machine as long as your StateMachine is deterministic. Major advantage is that it is an embeddable java library and doesn't require users to operate/deploy yet another external dependency. I'm planning to use it to make coordinators be the RAFT servers that would maintain the discovery related state.
On a sidenote , there is another library called "atomix" which uses "copycat" and provides more higher level abstractions like "Group Membership", "Distributed Locks" etc. In my experiments I found that it was simpler and more efficient to write our own state machine directly using copycat instead of depending on higher level abstractions.
Pros: RAFT algorithm is relatively easier to understand, Java Library, Embeddable, We can reduce some layers of abstraction to keep debugging rather simpler, Users don't need to operate cluster of another dependency
Cons: Code is mature and actively maintained since last 4-5 years I think but it hasn't been used in as many production environments as the other options mentioned later.

Consul ( https://www.consul.io/ )  - I haven't used it but, from a cursory look, it isn't embeddable and users need to operate a "consul cluster". It uses the protocol called "SWIM" ( https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf ) .
Pros: Probably more mature and battle tested
Cons: Not a java library, does not use RAFT which has been adopted in many more systems and probably battle tested in many different use cases

etcd ( https://coreos.com/etcd/ ) - Not mentioned in this thread yet but another player in same field, looks similar to Consul. It uses RAFT algorithm so likely similar to copycat at a fundamental level. I found a feature comparison page ( https://coreos.com/etcd/docs/latest/learning/why.html ) too.
Pros: Probably more mature and battle tested, Uses RAFT
Cons: Not a java library

As you see there are many options (including good old Curator/Zookeeper). We are better off not adding a hard dependency on any of those and I'm expecting state of the art to improve with many more options in future. That is the main reason for my approach of making "discovery" pluggable in Druid. Also, it has additional benefit of letting users keep using Zookeeper if they are happy with it.

I'm planning to start with copycat as it looks simplest and most straightforward to me. Initially I thought of writing my own java implementation of RAFT but decided to give copycat a try first. Given that all state being stored is volatile, with full cluster restart, I can easily switch back to Zookeeper (or Consul, Etcd etc) if copycat does not work out in the end or doesn't perform well enough.

Now, on the topic of removing dependency on relational databases. I know marketing is very important and there are possible technical benefits too, but I'm conservative about storing persistent state on consensus based systems at this time and trust relation databases more. Bias comes mainly from maturity of the solution and readily available expertise e.g most companies already have some use of relational database in their tech stack anyway, know how to deploy and operate it, know kind of hardware that is required for reliable storage, how to keep backups etc etc.

Maybe it is a passing feeling and later I will be all in for moving all persistent state to copycat/consul/etcd :) .

That said, I am definitely not against it and suggest being a bit careful about how we make that transition. The way forward should be to adjust metadata store related interfaces in ways that allow creation of metadata store extension on top of consul/etcd/copycat in addition to relational databases. So, concept of pluggable "metadata store" would still stay but users should be able to choose a system of their choice including mysql. In that world I could use copycat for discovery but mysql for metadata store and someone else could use consul for both discovery and metadata store.

-- Himanshu

Himanshu Gupta

unread,
Nov 15, 2017, 4:40:40 PM11/15/17
to Druid Development
> .. I didn't realize you were already working on something using copycat. Have you worked on it enough to know if it's "nice" (whatever that means -- performance, reliability, ease of integration with Druid)?

I did write a prototype implementation to run Druid cluster with copycat being used for discovery a while back , deployed it on a test environment and it worked. Integration with druid would not be difficult at least for the discovery interfaces that I mentioned before. I'm not sure how reliable it is till we actually deploy it at a larger scale on our metrics cluster. However, I did speak to the author of library a while back on gitter and he said that they were already using it in https://onosproject.org/ at scale and confident that it should work well.

Integration for other things might be a bit difficult compared to other alternatives as it offers replicated log for replicating the "commands" on all participating nodes and your state machine has to implement the local storage(possibly using derby for persistent state)  while other solutions offer key-value stores.

-- Himanshu

Gian Merlino

unread,
Nov 15, 2017, 6:00:48 PM11/15/17
to druid-de...@googlegroups.com
Integration for other things might be a bit difficult compared to other alternatives as it offers replicated log for replicating the "commands" on all participating nodes and your state machine has to implement the local storage(possibly using derby for persistent state)  while other solutions offer key-value stores.

I actually like that you bring your own state machine, so I would consider it a benefit. It means we can do one that is well suited to how Druid wants to use it. For some things that might not be a k/v store.

Your work on this is really interesting Himanshu, I wish you well :)

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Himanshu Gupta

unread,
Nov 15, 2017, 7:34:04 PM11/15/17
to Druid Development
Thanks. Yeah, with our own state machine we definitely have more control and do exactly what Druid needs.

-- Himanshu
Reply all
Reply to author
Forward
0 new messages