Add labels

Christian Oelsner

unread,

Nov 18, 2022, 4:51:41 AM11/18/22

to Prometheus Users

Hello,

I am trying to add labels to metrics fetched from Confluent Cloud.

We are monitoring some 35 Kafka clusters.

scrape_configs:

- job_name: Confluent Cloud

scrape_interval: 1m

scrape_timeout: 1m

honor_timestamps: true

static_configs:

- targets: - api.telemetry.confluent.cloud

scheme: https

basic_auth:

username: <Cloud API Key>

password: <Cloud API Secret>

metrics_path: /v2/metrics/cloud/export

params:

"resource.kafka.id":

- lkc-1

- lkc-2

- lkc-3

- lkc-4

- lkc-5

- lkc-6

- lkc-etc etc

Each lkc-xxxx represent a cluster which belongs to a department.

I would like to add a departmentID to the metrics belonging to to each cluster.

For example lkc-1 and lkc-5 would beong to department "analytics"

How would i go about adding labels to the metrics?

Best regards

Christian Oelsner

Brian Candler

unread,

Nov 18, 2022, 9:05:10 AM11/18/22

to Prometheus Users

> How would i go about adding labels to the metrics?

You have this:

static_configs:

- targets:

- api.telemetry.confluent.cloud

This means you are only scraping one endpoint, one time. If you wanted to add the same labels to every metric received from that endpoint, you would do this:

static_configs:

- labels:

foo: bar

baz: qux

targets:

- api.telemetry.confluent.cloud

Of course, that's not what you're asking.

The question now is, do the metrics that you get back all carry a label which identifies the cluster, such as {cluster="lkc-1"}?

If so, then it's a simple case of metric relabelling to add the department labels corresponding to each cluster ID. Add to the scrape job:

metric_relabel_configs:
- source_labels: [cluster]
regex: lkc-1
target_label: departmentID
replacement: Accounts
- source_labels: [cluster]
regex: lkc-2
target_label: departmentID
replacement: Engineering
# etc

If you don't have such a label, then you will need to scrape the API endpoint separately, once for each value of resource.kafka.id

The dumb option is multiple scrape jobs:

scrape_configs:
- job_name: Confluent Cloud lkc-1
scrape_interval: 1m
scrape_timeout: 1m
static_configs:
- labels:

department: Accounts

targets:
- api.telemetry.confluent.cloud
scheme: https
basic_auth:
username: <Cloud API Key>
password: <Cloud API Secret>
metrics_path: /v2/metrics/cloud/export
params:

"resource.kafka.id": [lkc-1]
- job_name: Confluent Cloud lkc-2
scrape_interval: 1m
scrape_timeout: 1m
static_configs:
- labels:

department: Engineering

targets:
- api.telemetry.confluent.cloud
scheme: https
basic_auth:
username: <Cloud API Key>
password: <Cloud API Secret>
metrics_path: /v2/metrics/cloud/export
params:

"resource.kafka.id": [lkc-2]

# ... etc

That should work just fine, but is annoyingly verbose and repetitive.

The second option, which I would normally use in this situation, is to set the query parameter using a __param_XXXX label:

scrape_configs:
- job_name: Confluent Cloud
scrape_interval: 1m
scrape_timeout: 1m

static_configs:
- labels:

department: Accounts
"__param_resource.kafka.id": lkc-1
targets:
- api.telemetry.confluent.cloud
- labels:
department: Engineering
"__param_resource.kafka.id": lkc-2
targets:
- api.telemetry.confluent.cloud
- labels:
department: Special Projects
"__param_resource.kafka.id": lkc-3
targets:
- api.telemetry.confluent.cloud
# etc

scheme: https
basic_auth:
username: <Cloud API Key>
password: <Cloud API Secret>
metrics_path: /v2/metrics/cloud/export

Here, the parameter value is set to a single value each time using the magic label "__param_<paramname>" instead of using "params: { name: [ list_of_values ] }"

Unfortunately, the problem is that I'm not sure that __param supports parameter names with dots in them, because dots are technically not valid in a label name. You would need to try it to find out if it works, and I wouldn't be surprised if it were rejected.

Aside:

- You should almost never use "honor_timestamps" so I have removed it in the examples above. If you do use it, you have to be very sure why, and understand how it may break things.

- When there are multiple targets like this I would use file_sd_configs rather than static_configs for this (it's easier to maintain).

The downside to these approaches is that you are now hitting the same API endpoint N times (each returning 1/Nth of the data). This only matters if you get charged per API call.

If you still want to fetch the responses in a single API call as you are now, then you will have to use metric_relabelling, and somehow decide for each metric that comes back which kafka cluster it came from by examining the labels - which is the first approach I proposed.

HTH,

Brian.

Christian Oelsner

unread,

Nov 19, 2022, 6:43:26 AM11/19/22

to Prometheus Users

Hi Brian,

Thanks for your input, i will try to work with them.

I put in the honor_timestamps only because it was done in the example config provided on the confluent cloud metrics api documentation.

The reason why i am fetching the metrics all in one call is that Confluent imposes a 60 requests limit pr hour, and we found that we often hit that limit and received an HTTP 439, too many requests. After that we were "locked" out for 15-20 mins. This was not optimal.

A quick query in prometheus for example gives me this:

confluent_kafka_server_retained_bytes{instance="api.telemetry.confluent.cloud:443", job="Confluent-Cloud", kafka_id="lkc-0x3v22", topic="confluent-kafka-connect-qa.confluent-kafka_configs"}

Does that mean that i have a label simply called kafka_id?

I did infact try to wrap my head around using file_sd_configs but could not work out how the params part of it, so i gave up on that. It would be nice though, since our list of clusters keps growing every week.

Let me try ome of your thoughts here in the weekend and report back here.

Thanks again.

/Christian Oelsner

Brian Candler

unread,

Nov 20, 2022, 5:36:57 AM11/20/22

to Prometheus Users

On Saturday, 19 November 2022 at 11:43:26 UTC christia...@gmail.com wrote:

A quick query in prometheus for example gives me this:
confluent_kafka_server_retained_bytes{instance="api.telemetry.confluent.cloud:443", job="Confluent-Cloud", kafka_id="lkc-0x3v22", topic="confluent-kafka-connect-qa.confluent-kafka_configs"}

Does that mean that i have a label simply called kafka_id?

Yes indeed. So if you can relate the values of that to the department, then you can use the simple metric relabelling I showed originally to add the departmentID label. But you need a separate rewrite rule for each kafka_id to department mapping - so you'll have to update the config every time you add a new cluster (which you're already doing to add the new query params).

There is another approach to consider: you can make a separate set of static timeseries with the metadata bindings, like this:

kafka_cluster_info{kafka_id="lkc-0x3v22", departmentID="Engineering", env="production"} 1

kafka_cluster_info{kafka_id="lkc-0x3v25", departmentID="Accounts", env="test"} 1

...

(A static timeseries can be made using node_exporter textfile_collector, or a static web page that you scrape)

The "kafka_id" label here has to match the "kafka_id" label values in the scraped data. Then whenever you do a query on one of the main metrics, you can do a join to add the extra metadata labels, something like this:

confluent_kafka_server_retained_bytes * on (kafka_id) group_left(departmentID,env) kafka_cluster_info

Or you can do filtering on the metadata to select only the clusters belonging to a particular department or for a particular environment, e.g.

confluent_kafka_server_retained_bytes * on (kafka_id) group_left(departmentID) kafka_cluster_info{env="production"}

For the full details of this approach see:

https://www.robustperception.io/how-to-have-labels-for-machine-roles
https://www.robustperception.io/exposing-the-software-version-to-prometheus
https://www.robustperception.io/left-joins-in-promql
https://prometheus.io/docs/prometheus/latest/querying/operators/#many-to-one-and-one-to-many-vector-matches

The tradeoff here is that your queries get more complex whenever you need the departmentID or environment labels, especially in alerting rules. Adding the extra labels at scrape time keeps your queries simpler.

You can also combine both approaches: use recording rules with join queries like those above, to create new metrics with the extra labels.

I did infact try to wrap my head around using file_sd_configs but could not work out how the params part of it, so i gave up on that. It would be nice though, since our list of clusters keps growing every week.

If you're only scraping the API once (because you have an API limit to avoid) then a single target with static_configs is fine.

Regards,

Brian.

Christian Oelsner

unread,

Nov 22, 2022, 7:24:05 AM11/22/22

to Prometheus Users

Hi Brian,

Once again, thanks a lot for your assistance.

I went with using the metric_relabel_config you showed in your first post.

It worked nicely.

Cheers :)

Regards

Christian Oelsner

Reply all

Reply to author

Forward