[ANN] Kafka ZooKeeper Exporter

Łukasz Mierzwa

unread,

Jul 20, 2017, 9:42:49 PM7/20/17

to Prometheus Users

Hi,

we (Cloudflare) are running a few Kafka clusters which uses ZooKeeper for cluster coordination, we export broker level metrics using jmx_exporter but we were missing metrics from the cluster level POV. We've noticed that under some scenarios broker might have a different view of the world than the rest of the cluster (network partitions) and when that happens it's might be not in sync with the rest of the cluster, but won't indicate any replication lag (we're still looking into this). To get metrics from the authoritative cluster state, ZooKeeper that is, we've created https://github.com/cloudflare/kafka_zookeeper_exporter that aims to give us better visibility and more detailed alerting.

One of the issues was that with only broker level metrics under-replicated alert are triggered on the leader node, not on the replica that is out of sync, so the first step is to always find the node affected. Cluster level metrics will give us clear list of nodes that are out of sync, which give us alerts triggered for affected node rather than of the leader.

Hopefully this will also be useful to others, feedback very much welcomed.

Łukasz Mierzwa

Ben Kochie

unread,

Jul 21, 2017, 2:00:03 AM7/21/17

to Łukasz Mierzwa, Prometheus Users

Thanks!

Any chance you can share the alerting rules in the repo as well?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3cbfac1a-8688-4c13-b44f-5da7e035f448%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,

Jul 21, 2017, 3:18:57 AM7/21/17

to Łukasz Mierzwa, Prometheus Users

Thanks for sharing, this'd be worth putting up on the official list.

This is a blackbox/snmp style exporter, so I'd suggest including an example of the Prometheus configuration for it as that's not exactly obvious to beginners.

I'd also remove the zookeeper and chroot labels, as that's something that should be handled on the Prometheus relabelling side. For example if there's only one chroot, the user won't want that label and they'll already have the instance label to cover what the zookeeper label is doing.

kafka_zookeeper_up isn't always being set, given you're failing the whole scrape if you can't talk to zookeeper (which is fine) I'd suggest removing this metric to avoid confusion and users trying to alert on it being 0.

--

Brian Brazil

www.robustperception.io

Łukasz Mierzwa

unread,

Jul 21, 2017, 12:03:58 PM7/21/17

to Brian Brazil, Prometheus Users

We're still working on a good set of alert rules, I'll add those once we review and test what we currently have.

I've added zookeeper & chroot labels to identify clusters, since they might have topics with same name, but the first thing I did when configuring scraping was to remove those and replace with a static label that identify each cluster. Seems like there's very little value in those labels in general and I'll drop it soon.

I'll add some Prometheus config examples to the docs.

Good point about kafka_zookeeper_up, it's only set after successful scrape, so one would need to alert using absent(kafka_zookeeper_up), which is a bit counter intuitive and that doesn't really need any dedicated metric.

--

Łukasz Mierzwa

Brian Brazil

unread,

Jul 21, 2017, 12:06:34 PM7/21/17

to Łukasz Mierzwa, Prometheus Users

On 21 July 2017 at 17:03, Łukasz Mierzwa <l.mi...@gmail.com> wrote:

We're still working on a good set of alert rules, I'll add those once we review and test what we currently have.
I've added zookeeper & chroot labels to identify clusters, since they might have topics with same name, but the first thing I did when configuring scraping was to remove those and replace with a static label that identify each cluster. Seems like there's very little value in those labels in general and I'll drop it soon.
I'll add some Prometheus config examples to the docs.

Good point about kafka_zookeeper_up, it's only set after successful scrape, so one would need to alert using absent(kafka_zookeeper_up), which is a bit counter intuitive and that doesn't really need any dedicated metric.

It makes sense to have a metric like that if there's useful metrics you can provide even if the target is down (e.g. haproxy_up, probe_success). Here though you're failing the scrape if the target has issues, so it's not providing additional information beyond what "up" already does.

Brian

On Fri, Jul 21, 2017 at 12:18 AM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 21 July 2017 at 02:42, Łukasz Mierzwa <l.mi...@gmail.com> wrote:
Hi,

we (Cloudflare) are running a few Kafka clusters which uses ZooKeeper for cluster coordination, we export broker level metrics using jmx_exporter but we were missing metrics from the cluster level POV. We've noticed that under some scenarios broker might have a different view of the world than the rest of the cluster (network partitions) and when that happens it's might be not in sync with the rest of the cluster, but won't indicate any replication lag (we're still looking into this). To get metrics from the authoritative cluster state, ZooKeeper that is, we've created https://github.com/cloudflare/kafka_zookeeper_exporter that aims to give us better visibility and more detailed alerting.
One of the issues was that with only broker level metrics under-replicated alert are triggered on the leader node, not on the replica that is out of sync, so the first step is to always find the node affected. Cluster level metrics will give us clear list of nodes that are out of sync, which give us alerts triggered for affected node rather than of the leader.
Hopefully this will also be useful to others, feedback very much welcomed.

Thanks for sharing, this'd be worth putting up on the official list.

This is a blackbox/snmp style exporter, so I'd suggest including an example of the Prometheus configuration for it as that's not exactly obvious to beginners.
I'd also remove the zookeeper and chroot labels, as that's something that should be handled on the Prometheus relabelling side. For example if there's only one chroot, the user won't want that label and they'll already have the instance label to cover what the zookeeper label is doing.
kafka_zookeeper_up isn't always being set, given you're failing the whole scrape if you can't talk to zookeeper (which is fine) I'd suggest removing this metric to avoid confusion and users trying to alert on it being 0.

--
Brian Brazil
www.robustperception.io

--
Łukasz Mierzwa

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAFcbevsSKbcVG9JaZw%3D7VeZbgcFCYYs3ZfDswqF5mhR-x1_Ovw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Łukasz Mierzwa

unread,

Jul 21, 2017, 7:43:13 PM7/21/17

to Brian Brazil, Prometheus Users

Speaking of example config, Kafka doesn't know much about hostnames (kafka1), only about broker IDs which are numeric (101). To match with broker level metrics that all have instance=kafka1 I wanted to rewrite instance label with:

- job_name: kafka_zookeeper

dns_sd_configs:

- names:

- kafka-exporter-dns-srv

metrics_path: /kafka

params:

zookeeper: ['zk:2181']

chroot: ['/kafka/cluster]

relabel_configs:

- action: replace

source_labels: [replica]

regex: (.+)

target_label: instance

replacement: kafka${1}

But I can't get it to work.

Docs state:

> After relabeling, the instance label is set to the value of __address__ by default if it was not set during relabeling.

So I would expect that to work, but it seems that this relabeling is global to all metrics from a single scrape, UI shows it on scrape job label list with instance=kafka.

Am I assuming correctly that this won't provide different relabeling value for each scraped metric?

--

Łukasz Mierzwa

Brian Brazil

unread,

Jul 22, 2017, 3:49:59 AM7/22/17

to Łukasz Mierzwa, Prometheus Users

On 22 July 2017 at 00:42, Łukasz Mierzwa <l.mi...@gmail.com> wrote:

Speaking of example config, Kafka doesn't know much about hostnames (kafka1), only about broker IDs which are numeric (101). To match with broker level metrics that all have instance=kafka1 I wanted to rewrite instance label with:

- job_name: kafka_zookeeper
dns_sd_configs:
- names:
- kafka-exporter-dns-srv
metrics_path: /kafka
params:
zookeeper: ['zk:2181']
chroot: ['/kafka/cluster]
relabel_configs:
- action: replace
source_labels: [replica]
regex: (.+)
target_label: instance
replacement: kafka${1}

But I can't get it to work.
Docs state:
> After relabeling, the instance label is set to the value of __address__ by default if it was not set during relabeling.

So I would expect that to work, but it seems that this relabeling is global to all metrics from a single scrape, UI shows it on scrape job label list with instance=kafka.
Am I assuming correctly that this won't provide different relabeling value for each scraped metric?

relabel_configs affect targets, and there's no replica name to work off here. That relabel action shouldn't even be setting the instance to kafka, as the replica target label will be empty with that configuration. Even using metric_relabel_configs here would be odd, as the change to the instance label would be wrong as it's not the zookeeper you talked to and would also lead to collisions.

What you want to do here I think is in your queries use label_replace() to copy the instance label of your kafkas over to the replica label, and then work from there.

--

Brian Brazil

www.robustperception.io

Łukasz Mierzwa

unread,

Jul 22, 2017, 12:45:13 PM7/22/17

to Brian Brazil, Prometheus Users

Thanks, label_replace() seems like a good tip.

Brian

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAFcbevsSKbcVG9JaZw%3D7VeZbgcFCYYs3ZfDswqF5mhR-x1_Ovw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

--
Łukasz Mierzwa

--
Brian Brazil
www.robustperception.io

ami...@gmail.com

unread,

Jan 23, 2018, 8:33:43 AM1/23/18

to Prometheus Users

Hi all ,

I am using kafka and zookeeper exporter from above link , do i need to have jmx exporter to be able to view data in grafana . ?

Łukasz Mierzwa

unread,

Jan 23, 2018, 12:25:33 PM1/23/18

to ami...@gmail.com, Prometheus Users

You don't need jmx_exporter to use https://github.com/cloudflare/kafka_zookeeper_exporter

You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/WEu7iTS-cV8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b632722e-4500-4314-b082-70d69f31a75c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Łukasz Mierzwa

amit mishra

unread,

Jan 24, 2018, 11:42:50 AM1/24/18

to Łukasz Mierzwa, Prometheus Users

hi all ,

I am getting this error " text format parsing error in line 2: invalid metric name" in prometheus , but my curl request "curl localhost:9381/kafka?zookeeper=infra-zookeeper:2181&chroot=/kafka/cluster" is fetching me result.

Please help

Reply all

Reply to author

Forward