kafka streams does not support one broker down

216 views
Skip to first unread message

yoges...@gmail.com

unread,
Jul 26, 2018, 6:33:47 AM7/26/18
to Confluent Platform
I have 3 nodes running kafka cluster of 3 broker

and i am running the 3 kafka stream with same application.id

each node has one broker one kafka stream application

everything works fine during setup

i bringdown one node, so one kafka broker and one streaming app is down

now i see exceptions in other two streaming apps and it never gets re balanced waited for hours and never comes back to norma

is there anything am missing?

i also tried looking into when one broker is down call stream.close, cleanup and restart this also doesn't help

can anyone help me?

Matthias J. Sax

unread,
Aug 1, 2018, 12:40:50 PM8/1/18
to confluent...@googlegroups.com
There is a JIRA with a discussion for this issue:
https://issues.apache.org/jira/browse/KAFKA-7209
> --
> You received this message because you are subscribed to the Google
> Groups "Confluent Platform" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to confluent-platf...@googlegroups.com
> <mailto:confluent-platf...@googlegroups.com>.
> To post to this group, send email to confluent...@googlegroups.com
> <mailto:confluent...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/7b5056e5-c3b2-437a-94b8-5a09a83dfe17%40googlegroups.com
> <https://groups.google.com/d/msgid/confluent-platform/7b5056e5-c3b2-437a-94b8-5a09a83dfe17%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

signature.asc

Victoria

unread,
Aug 5, 2018, 1:01:39 PM8/5/18
to Confluent Platform
Hi, Matthias

We have experienced a similar situation with Kafka streams last week.
We are using 1.0.0 Kafka streams.
We have 5 Kafka Brokers and 6 Kafka streams (3 threads per Kafka stream instance).
One of our Kafka brokers crushed but didn't die completely, leaving it in somewhat "zombie" state.
Meanwhile, our Kafka streams took reset.
After few hours the problem with Kafka broker was identified and the problematic Kafka broker was restarted.
At first it seemed that Kafka streams started to recover, all threads were in ASSIGNED state.
However, hours passed and we didn't get to RUNNING at any of the threads.
We added StateRestoreListener to monitor the situation.
What we saw is that at the beginning onBatchRestored and onRestoreEnd methods were called quite a lot.
However, after about an hour the amount of calls to those methods droped significantly.
From there on the amount of calls keeps decreasing until, after ~12 hours, there are no calls at all.
However, many of the threads were still in ASSIGNED state and didn't reach RUNNING.
We were able to see from our analysis that onRestoreEnd was called for about 40% of the partitions.
We waited for over 24 hours but nothing changed.

We would appreciate any advice regards what might went wrong and how it can be prevented at the future.



On Wednesday, August 1, 2018 at 7:40:50 PM UTC+3, Matthias J. Sax wrote:
There is a JIRA with a discussion for this issue:
https://issues.apache.org/jira/browse/KAFKA-7209

On 7/26/18 3:33 AM, yoges...@gmail.com wrote:
> I have 3 nodes running kafka cluster of 3 broker
>
> and i am running the 3 kafka stream with same application.id
>
> each node has one broker one kafka stream application
>
> everything works fine during setup
>
> i bringdown one node, so one kafka broker and one streaming app is down
>
> now i see exceptions in other two streaming apps and it never gets re
> balanced waited for hours and never comes back to norma
>
> is there anything am missing?
>
> i also tried looking into when one broker is down call stream.close,
> cleanup and restart this also doesn't help
>
> can anyone help me?
>
> --
> You received this message because you are subscribed to the Google
> Groups "Confluent Platform" group.
> To unsubscribe from this group and stop receiving emails from it, send

Matthias J. Sax

unread,
Aug 7, 2018, 12:31:40 PM8/7/18
to confluent...@googlegroups.com
I am not entirely sure. But one suspicion might be, that as soon as a
task is retore, processing for this task continuous. Thus, processing
and restore interleaves for different tasks. Because processing also
required CPU and network capacity, it can slow down the restore process.
While more and more tasks gets restored, restore might slow down more
and more.

Can you verify if restoring is still happening (even with reduced
throughput?). There should still be calls to the restore listener and
also some log statements about fetch request (it think you would need to
enable DEBUG logging).

-Matthias
> On 7/26/18 3:33 AM, yoges...@gmail.com <javascript:> wrote:
> > I have 3 nodes running kafka cluster of 3 broker
> >
> > and i am running the 3 kafka stream with same application.id
> <http://application.id>
> >
> > each node has one broker one kafka stream application
> >
> > everything works fine during setup
> >
> > i bringdown one node, so one kafka broker and one streaming app is
> down
> >
> > now i see exceptions in other two streaming apps and it never gets re
> > balanced waited for hours and never comes back to norma
> >
> > is there anything am missing?
> >
> > i also tried looking into when one broker is down call stream.close,
> > cleanup and restart this also doesn't help
> >
> > can anyone help me?
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Confluent Platform" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to confluent-platf...@googlegroups.com
> <javascript:>
> > <mailto:confluent-platf...@googlegroups.com
> <javascript:>>.
> > To post to this group, send email to confluent...@googlegroups.com
> <javascript:>
> > <mailto:confluent...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/confluent-platform/7b5056e5-c3b2-437a-94b8-5a09a83dfe17%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Confluent Platform" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to confluent-platf...@googlegroups.com
> <mailto:confluent-platf...@googlegroups.com>.
> To post to this group, send email to confluent...@googlegroups.com
> <mailto:confluent...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/b162d70a-209e-4334-916d-a39fbe4b6bd7%40googlegroups.com
> <https://groups.google.com/d/msgid/confluent-platform/b162d70a-209e-4334-916d-a39fbe4b6bd7%40googlegroups.com?utm_medium=email&utm_source=footer>.
signature.asc
Reply all
Reply to author
Forward
0 new messages