Is it okay to post a Prometheus user survey here?

109 views
Skip to first unread message

Tom Lee

unread,
May 13, 2020, 1:21:29 PM5/13/20
to Prometheus Users
Hi folks,

Full disclosure: I'm an engineer from New Relic (https://newrelic.com/). We've been looking into improving our open source monitoring story and Prometheus is a key piece of that. Right now, though, there are some pieces of the puzzle that we can't easily dig into without more input from the Prometheus community at large.

Is this mailing list an okay place to send a Google Forms type survey with maybe half a dozen questions? And if not, can folks suggest somewhere that might be more appropriate?

Cheers,
Tom

Julius Volz

unread,
May 13, 2020, 3:37:36 PM5/13/20
to Tom Lee, Prometheus Users
Hi Tom,

Thanks for checking in first! We're currently discussing within the Prometheus Team how we would prefer to handle such requests in general (so that things remain fair between companies, etc.) and will get back to you as soon.

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b48e651a-b264-4ade-8eb8-7dc7eec11b15%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com

Tom Lee

unread,
May 13, 2020, 3:44:10 PM5/13/20
to Julius Volz, Prometheus Users
Understood Julius, appreciate the transparency. Thank you!

Richard Hartmann

unread,
May 15, 2020, 12:25:54 PM5/15/20
to Tom Lee, Julius Volz, Prometheus Users
Hi Tom,

after some internal deliberation, we think it would be unfair to give
any single survey our official blessing, and running more than once
every, say, year seems to be too much for users, too. On the other
hand, user surveys make sense for everyone.

Would you be OK with sending your questions to
prometh...@googlegroups.com or as a reply in this thread? We
would then publish them for comments/feedback and run the survey under
the Prometheus umbrella, sharing replies publicly.


Best,
Richard

On Wed, May 13, 2020 at 9:44 PM 'Tom Lee' via Prometheus Users
<promethe...@googlegroups.com> wrote:
>
> Understood Julius, appreciate the transparency. Thank you!
>
> On Wed, May 13, 2020 at 12:37 PM Julius Volz <juliu...@promlabs.com> wrote:
>>
>> Hi Tom,
>>
>> Thanks for checking in first! We're currently discussing within the Prometheus Team how we would prefer to handle such requests in general (so that things remain fair between companies, etc.) and will get back to you as soon.
>>
>> Regards,
>> Julius
>>
>> On Wed, May 13, 2020 at 7:21 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
>>>
>>> Hi folks,
>>>
>>> Full disclosure: I'm an engineer from New Relic (https://newrelic.com/). We've been looking into improving our open source monitoring story and Prometheus is a key piece of that. Right now, though, there are some pieces of the puzzle that we can't easily dig into without more input from the Prometheus community at large.
>>>
>>> Is this mailing list an okay place to send a Google Forms type survey with maybe half a dozen questions? And if not, can folks suggest somewhere that might be more appropriate?
>>>
>>> Cheers,
>>> Tom
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b48e651a-b264-4ade-8eb8-7dc7eec11b15%40googlegroups.com.
>>
>>
>>
>> --
>> Julius Volz
>> PromLabs - promlabs.com
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAMUmz5j6aqwyJ8aDmFvggkD%2BeTpMMLzM0zTp1kXgnFP%2Bd%3Dw5fQ%40mail.gmail.com.



--
Richard

Tom Lee

unread,
May 15, 2020, 1:27:37 PM5/15/20
to Richard Hartmann, Julius Volz, Prometheus Users
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
  • If you use Grafana to visualize your Prometheus data, what version of Grafana do you typically use?
  • What value do you typically use for the "scrape_interval" config setting in your Prometheus servers?
  • Is there anything else you would like to tell us about your Prometheus deployment(s)? For example, interesting challenges, pain points, or quirks of your configuration?
Look forward to whatever might eventuate here, those big community surveys are always a lot of fun to read through.

Cheers,
Tom

Ben Kochie

unread,
May 16, 2020, 4:25:20 AM5/16/20
to Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?
  • If you use Grafana to visualize your Prometheus data, what version of Grafana do you typically use?
  • What value do you typically use for the "scrape_interval" config setting in your Prometheus servers?
  • Is there anything else you would like to tell us about your Prometheus deployment(s)? For example, interesting challenges, pain points, or quirks of your configuration?
I have a few additional questions that could be added to the list.

* How many unique exporter/target types do you have?
* What is your samples/second ingestion rate across all Prometheus servers?
* What is your general metric retention time?
* Do you use external storage (Federation/remote_write/etc)
* If yes, which external storage system(s)?
 

Brian Brazil

unread,
May 16, 2020, 5:01:43 AM5/16/20
to Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
On Sat, 16 May 2020 at 09:25, Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?
  • If you use Grafana to visualize your Prometheus data, what version of Grafana do you typically use?
  • What value do you typically use for the "scrape_interval" config setting in your Prometheus servers?
  • Is there anything else you would like to tell us about your Prometheus deployment(s)? For example, interesting challenges, pain points, or quirks of your configuration?
I have a few additional questions that could be added to the list.

* How many unique exporter/target types do you have?
* What is your samples/second ingestion rate across all Prometheus servers?
* What is your general metric retention time?
* Do you use external storage (Federation/remote_write/etc)
* If yes, which external storage system(s)?

I'd be also interested in what versions of Java/Python are in use, and in particular what are the oldest JVM versions that users are using the jmx_exporter with.

Brian
 

Julius Volz

unread,
May 16, 2020, 5:10:59 AM5/16/20
to Brian Brazil, Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
If we're getting that specific I'd be worried there'd be a lot of other questions in that specificity category. But if it's only these couple it would be fine.
 

Julius Volz

unread,
May 16, 2020, 5:13:08 AM5/16/20
to Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
On Sat, May 16, 2020 at 10:25 AM Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?

I guess the first question is about unique metric names. The problem is that there's no easy way to get the number of unique metric names across multiple servers, as there might be anything between 0 - 100% overlap of metric names between Prometheus servers, and getting users to calculate a set union might be too much work. Also, time series are more relevant than number of metrics in Prometheus, so maybe we should only keep the second question?

Ben Kochie

unread,
May 16, 2020, 5:18:50 AM5/16/20
to Julius Volz, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
On Sat, May 16, 2020 at 11:13 AM Julius Volz <juliu...@gmail.com> wrote:
On Sat, May 16, 2020 at 10:25 AM Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?

I guess the first question is about unique metric names. The problem is that there's no easy way to get the number of unique metric names across multiple servers, as there might be anything between 0 - 100% overlap of metric names between Prometheus servers, and getting users to calculate a set union might be too much work. Also, time series are more relevant than number of metrics in Prometheus, so maybe we should only keep the second question?

Yes, I'm interested in what Tom's intent is behind the question. From a Prometheus perspective, the total time-series load is most important. But it might be different for his use case.

We should probably include some specific PromQL queries to make the results easy to gather for survey participants. 

Brian Brazil

unread,
May 16, 2020, 5:26:57 AM5/16/20
to Julius Volz, Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
I don't see how this is much different in terms of specificity as some of the other proposed questions.

These are things I want to know as the Java/Python maintainer, as we currently support some rather old versions of these runtimes. It'd be good to know if it's e.g. safe to drop Python 2.6 support.

Brian

Julius Volz

unread,
May 16, 2020, 5:29:05 AM5/16/20
to Brian Brazil, Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
Fine with me if you want to add these to the doc. I do think they're more specific in that they're less about the high-level deployment stats, and relevant for a smaller number of users.

Julius Volz

unread,
May 16, 2020, 5:40:30 AM5/16/20
to Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
Hi Tom,

I shared the current state of people's questions in this doc, open for comments by the public:


I hope it wouldn't grow much larger than its current length, but I think the current questions would already give us a pretty good overview over the Prometheus deployments out there.

Cheers,
Julius

Stuart Clark

unread,
May 16, 2020, 6:04:33 AM5/16/20
to Ben Kochie, Julius Volz, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
On 16/05/2020 10:18, Ben Kochie wrote:


On Sat, May 16, 2020 at 11:13 AM Julius Volz <juliu...@gmail.com> wrote:
On Sat, May 16, 2020 at 10:25 AM Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?

I guess the first question is about unique metric names. The problem is that there's no easy way to get the number of unique metric names across multiple servers, as there might be anything between 0 - 100% overlap of metric names between Prometheus servers, and getting users to calculate a set union might be too much work. Also, time series are more relevant than number of metrics in Prometheus, so maybe we should only keep the second question?

Yes, I'm interested in what Tom's intent is behind the question. From a Prometheus perspective, the total time-series load is most important. But it might be different for his use case.

We should probably include some specific PromQL queries to make the results easy to gather for survey participants.


I think it would be really useful to give details of how to find the answers to all the questions - simple command line commands, etc - PromQL query, python -V, etc.

It wants to be as easy as possible to answer in my opinion.

Would we be interested in the usage of the wider ecosystem? E.g usage of different SD methods, Alertmanager integrations, remote read/write systems, Thanos/Cortex/VictoriaMetrics?


Julius Volz

unread,
May 16, 2020, 12:12:10 PM5/16/20
to Stuart Clark, Ben Kochie, Tom Lee, Richard Hartmann, Julius Volz, Prometheus Users
On Sat, May 16, 2020 at 12:04 PM Stuart Clark <stuart...@jahingo.com> wrote:
On 16/05/2020 10:18, Ben Kochie wrote:


On Sat, May 16, 2020 at 11:13 AM Julius Volz <juliu...@gmail.com> wrote:
On Sat, May 16, 2020 at 10:25 AM Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?

I guess the first question is about unique metric names. The problem is that there's no easy way to get the number of unique metric names across multiple servers, as there might be anything between 0 - 100% overlap of metric names between Prometheus servers, and getting users to calculate a set union might be too much work. Also, time series are more relevant than number of metrics in Prometheus, so maybe we should only keep the second question?

Yes, I'm interested in what Tom's intent is behind the question. From a Prometheus perspective, the total time-series load is most important. But it might be different for his use case.

We should probably include some specific PromQL queries to make the results easy to gather for survey participants.


I think it would be really useful to give details of how to find the answers to all the questions - simple command line commands, etc - PromQL query, python -V, etc.


I added some instructions where it seemed relevant to the draft doc. 

It wants to be as easy as possible to answer in my opinion.

Would we be interested in the usage of the wider ecosystem? E.g usage of different SD methods, Alertmanager integrations, remote read/write systems, Thanos/Cortex/VictoriaMetrics?

We have long-term storage in there already, but indeed SD and AM integrations would be great as well! Will add. 

Tom Lee

unread,
May 17, 2020, 1:57:14 PM5/17/20
to Ben Kochie, Julius Volz, Richard Hartmann, Julius Volz, Prometheus Users
Hey Ben,

Wow, look at all this activity! Thanks so much for jumping on this stuff.

Just answering a question that was directed at me below:

On Sat, May 16, 2020 at 2:18 AM Ben Kochie <sup...@gmail.com> wrote:


On Sat, May 16, 2020 at 11:13 AM Julius Volz <juliu...@gmail.com> wrote:
On Sat, May 16, 2020 at 10:25 AM Ben Kochie <sup...@gmail.com> wrote:
Thanks for the link to the other survey. That's pretty good.

On Fri, May 15, 2020 at 7:27 PM 'Tom Lee' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Richard,

Reading between the lines it sounds like we're potentially talking about a broader/larger "State of Clojure" type thing for Prometheus. Is that accurate?

Certainly don't mind the results being public. My only real concern is timelines: we were hoping to use some of the raw data to help advise some load testing on our end, and things are already looking pretty aggressive. If we're looking at something that's going to take weeks or more to start seeing results rolling in we probably won't quite get the data we were hoping to get in time. From a purely selfish perspective we'd be pretty disappointed to go forward without data from "the source", so to speak. Of course, I totally understand the team's actions here. I'm just whining to myself.

Timelines aside, we'd be excited to see something "official" in the longer term. It would be useful for engineers like myself, and I know there are product managers and research folks lurking our virtual halls who would love such readily available data for future efforts.

The questions from our survey:
  • Roughly how many Prometheus *servers* are you operationally responsible for?
  • Of all the Prometheus servers that you are responsible for, which version would you say is the most widely deployed?
 The first two questions are good. I might modify the first one to clarify with/without HA. For example, we have 21 Prometheus servers, but 7 of those are duplicates for HA.
  • How many unique metrics are reporting across all of your Prometheus servers?
  • How many unique *timeseries* are reporting across all of your Prometheus servers?
These two need to be clarified for Prometheus. We tend to use the terms metrics and time-series interchangeably. Are you asking about unique metric names?

I guess the first question is about unique metric names. The problem is that there's no easy way to get the number of unique metric names across multiple servers, as there might be anything between 0 - 100% overlap of metric names between Prometheus servers, and getting users to calculate a set union might be too much work. Also, time series are more relevant than number of metrics in Prometheus, so maybe we should only keep the second question?

Yes, I'm interested in what Tom's intent is behind the question. From a Prometheus perspective, the total time-series load is most important. But it might be different for his use case.

Ah yep, really great question. I'm going to absolutely butcher the terminology here, but the idea is we're sort of trying to differentiate between "number of unique metric names" and "label/dimensional cardinality within those metrics". The reason for us differentiating is something of an implementation detail with respect to our own systems, but I think it also applies somewhat to Prometheus and/or Grafana too: when you run a non-aggregating query for a metric x, you might expect to see one timeseries charted -- or you might see hundreds or even thousands. In our own test setup we have JMX metrics for 15 Kafka servers reporting in. Executing a "query" like kafka_cluster_Partition_Value (a metric reported by the JMX exporter on behalf of Kafka) yields something like 20,000-30,000 distinct timeseries charted by Prometheus. It spends a surprising amount of time to execute that simple little query as a result. This sort of cardinality "explosion" has big implications for system architecture and scalability in our own systems, too.

Please let me know if that's still not clear!
 
We should probably include some specific PromQL queries to make the results easy to gather for survey participants. 

Yeah this is a great idea. My own PromQL skills are pretty lame or I probably would've done something like this myself. :)

Julius Volz

unread,
May 20, 2020, 9:38:42 AM5/20/20
to Tom Lee, Ben Kochie, Richard Hartmann, Julius Volz, Prometheus Users
On Sun, May 17, 2020 at 7:57 PM Tom Lee <t...@newrelic.com> wrote:
Yes, I'm interested in what Tom's intent is behind the question. From a Prometheus perspective, the total time-series load is most important. But it might be different for his use case.

Ah yep, really great question. I'm going to absolutely butcher the terminology here, but the idea is we're sort of trying to differentiate between "number of unique metric names" and "label/dimensional cardinality within those metrics". The reason for us differentiating is something of an implementation detail with respect to our own systems, but I think it also applies somewhat to Prometheus and/or Grafana too: when you run a non-aggregating query for a metric x, you might expect to see one timeseries charted -- or you might see hundreds or even thousands. In our own test setup we have JMX metrics for 15 Kafka servers reporting in. Executing a "query" like kafka_cluster_Partition_Value (a metric reported by the JMX exporter on behalf of Kafka) yields something like 20,000-30,000 distinct timeseries charted by Prometheus. It spends a surprising amount of time to execute that simple little query as a result. This sort of cardinality "explosion" has big implications for system architecture and scalability in our own systems, too.

Sorry for the delay! Yeah, makes sense, metric names that have many series can be problematic in UIs when doing queries without filters or aggregations. On the other hand, we know that having at least *some* of those is very common (almost every user has a couple huge ones), so we probably don't need a survey to tell us that :) More importantly maybe, to see how many metrics are too "overloaded", just having the total number metric names vs. the total number of series doesn't answer the question fully: you don't know whether the series are evenly split up across your metric names, or whether they're all clustered in a few names. It's also a bit challenging to get users to compile a list of distinct metric names across Prometheus servers, without some command-line foo or similar. We could ask something along the lines of "How many series do your largest N metric names contain?", and then give them a query like 'topk(3, count by(__name__) ({__name__!=""}))' to determine that per server. It would still require some manual work to combine results between servers though, hmmm...

Tom Lee

unread,
May 20, 2020, 1:00:38 PM5/20/20
to Julius Volz, Ben Kochie, Richard Hartmann, Julius Volz, Prometheus Users
Yeah, agree. I really like the "largest N metric names" idea. I think both total series and "top N metrics" are interesting for different reasons, but also agree getting "real" numbers is a challenge whatever we decide to do here. :)

Julius Volz

unread,
May 22, 2020, 4:49:43 AM5/22/20
to Tom Lee, Ben Kochie, Richard Hartmann, Julius Volz, Prometheus Users
Yeah, I think as interesting as this could be, the survey is growing quite large already, and this would be one of the more complicated questions in terms of explaining it clearly enough and then getting users to compile the results. So I'm tending towards leaving it out this time around.

But from experience you can safely assume that most large Prometheus deployments have a few metric names that are huge in their number of series (like a couple of 10k), and that would blow up any graph or other UI display without aggregation / filtering.

Julius Volz

unread,
May 22, 2020, 9:47:27 AM5/22/20
to Tom Lee, Ben Kochie, Richard Hartmann, Julius Volz, Prometheus Users
Hi Tom,

Just posted the final survey here:


Let's see what results look like, hope it's helpful although not all questions made it this time :)

Regards,
Julius

Tom Lee

unread,
May 22, 2020, 10:43:35 AM5/22/20
to Julius Volz, Ben Kochie, Richard Hartmann, Julius Volz, Prometheus Users
Awesome, appreciate it. Thank you all so much for your help getting this out!
Reply all
Reply to author
Forward
0 new messages