zookeeper crashing effects on druid

394 views
Skip to first unread message

Avinoam Ben Nachum

unread,
Dec 7, 2020, 9:04:18 AM12/7/20
to Druid User
Hello dear Druid team,
This might be an obvious question to some, but I would like to boost my confidance with an official answer from an expert.

We have a druid cluster running with a stand alone zookeeper server on a dedicated VM.
The cluster is very stable and is not being changed or restarted unless needed.

My question is:
what happens if the Zookeeper service/server crash, how long - if at all - will this effect the cluster and in what way?

I am asking, because I am thinking of creating a zookeeper ensemble but not sure we actually need one. as we can restore the zookeeper from a snapshot if needed.

Thank you for your time, much appreciation.
Avinoam

Peter Marshall

unread,
Dec 9, 2020, 8:07:01 AM12/9/20
to Druid User
The docs page on Zookeeper is a good starting place:

I'd say it's always best to have Druid connected to an ensemble for resilience, so I would definitely go ahead with that :D  Druid will "survive" for a while, but you can see from the docs that it is important to interprocess communication in a number of ways.

Avinoam Ben Nachum

unread,
Dec 9, 2020, 8:43:54 AM12/9/20
to Druid User

Hi Peter, Thank you for your reply!
I have read the docs, also google'd this but never found a proper-solid-simple answer that my question. And i am afraid to say that - though i was very happy to receive it - your reply also didn't answer my question.

from what i understand, if i shut down my zookeeper now, the cluster will not be effected at all, unless/untill a znode will be restarted. in which case, it will not find the zookeeper and thus be out of the cluster.
is my understanding correct?

best regards,
Avinoam
-

Peter Marshall

unread,
Dec 9, 2020, 9:50:54 AM12/9/20
to druid...@googlegroups.com
Aha!  I thought you might want details !!!  OK so this is a "high level" diagram of the communication lines

"Coordinator leader election" and "Overlord leader election" is through Zookeeper.  If ZK is down, then Druid will be unable to elect new leaders to control ingestion and to publish new data.
"Segment load/drop protocol between Coordinator and Historical" is those purple lines below: while the Coordinator will know that new data is available, it will not be able to tell Historicals to load the data - and it will not be able to rebalance the cluster, either.
"Segment "publishing" protocol from Historical" refers to that "OK" message that ingestion jobs get when data has been ingested, saying that it is safe.  Therefore, ingestion will start to degrade.
"Overlord and MiddleManager task management" is, as you might expect, everything to do with the Overlord conducting its ingestion effectively.  It will be unable to issue new tasks, for example.

Meanwhile, the broker will still have the timeline of data in memory - it will know where data is in the "last good state" - so querying will continue to work, notwithstanding those issues with ingestion above.

I hope this helps?


image.png

--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/S_IDI_J9Uu4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/beb64ab3-908c-48dc-9e1d-7392c2716ebdn%40googlegroups.com.


--
Peter Marshall
Apache Druid® Community Technology Evangelist


Peter Marshall

unread,
Dec 9, 2020, 9:53:32 AM12/9/20
to druid...@googlegroups.com
(I'm also going to ask someone I know who knows more than me (!!) to check what I've written....!)

Avinoam Ben Nachum

unread,
Dec 10, 2020, 4:28:48 AM12/10/20
to Druid User
Peter, again, thank you very much for taking the time and replying with such detail!

So although our system is robust, you kinda convinced me that there should be some sort of solution here for resilience.
I believe a 3 zookeepers ensemble should do the trick then. back to doc's reading for understanding best practice of this in production env.

Thanks again Peter, I wish you enjoy your holidays.
Avinoam

Peter Marshall

unread,
Dec 11, 2020, 7:08:31 AM12/11/20
to druid...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages