Feasible maximum number of label values? (and label Q's in general)

3,504 views

Skip to first unread message

john...@gmail.com

unread,

Dec 15, 2016, 3:34:16 PM12/15/16

to Prometheus Users

We're plugging Prometheus into DNS monitoring on a fairly large scale. One of our desired goals is to understand how many queries of various types and methods we're seeing from network ranges as well as the autonomous system neighbors and origins associated with those networks. We are building a collector that will parse our DNS queries into time-series data, and will add labels that give us enough dimensions to perform the queries that we want. We are taking BGP announcements and creating labels based on those announcements, and associating them with counters for various metrics along with the additional ASN origin/transit labels.

Here is an example metric, which stores how many "A" record requests we receive, along with the labels associated with each entry:

dns_num_query_type_A{env="prod", loc="ams", shost="res409", region="eu", origin-ip="10.3.2.0/24", origin-as="1239", transit-as="2914"}

env will have <10 values ("prod", "dev", "test", and a few more TBD)

loc will have ~180 values (three-digit city alpha codes)

shost will have hostnames up to ~3000 label values (six digit alphanumeric)

region has ~8 values (global area, two digit alpha codes)

origin-ip will have up to =>~700,000 label values which will be a CIDR notation network (i.e.: 10.3.2.0/24)

origin-as will have up to =>~60000 label values (int64)

transit-as will have up to =>~3000 label values (int64)

My question is: What problems am I going to hit with Prometheus using such large label dimensions? Will this work at all?

Obviously, I'm concerned about the "origin-ip" label and the "origin-as" label having so many possible values. There are lot of devils in the details here which make the labels actually a lot less scary than it looks (it's not fully expanded dimensionally) but it's still a really big number.

I've read the warning (https://prometheus.io/docs/practices/naming/) about high cardinality in labels. I don't see any way to get away from it in this case. It's also not exactly "unbounded" since there is a maximum limit on each of the values, and the churn on many of the labels is extremely low.

Why are we doing this? The operational reasons I will leave out of this discussion, since that is an internal issue and you'll have to take my word on it. These are mostly used for "top-10" type queries in Grafana in various ways to help us locate high-volume peers or transit partners, identify DDoS attacks, build white/black lists, and feed some of our other in-house tools that balance/direct traffic based on activity. We wish to be able to perform queries like: "show me the top 10 transit peers in Amsterdam sending us A record requests in the last 24 hours." As far as I can see, this is possible with Prometheus but I'd like to hear if this is actually a workable plan or if I need to move to some other datastore for our short-term monitoring. We've experimented with some data so far, but I always like to get opinions from people with battle scars before I attack an unknown problem.

Note: we're only going to be logging data that we see, so if there is no traffic from a particular network, we will remove that network from the scraped set of data after some interval. If we never see data from a particular network, we'll never log a value for it from any of our servers. This should keep the set of data we're pulling from each DNS resolver to a reasonable number, so we're not pulling a fully expanded set of data from anywhere. Each server would be handing back (as a rough example) 1200 different origin-IP labels, and origin-as and transit-as would never be more than 1 per origin-IP entry (because at any one time, there can only be one origin AS and one transit AS for a network in any site, even though those values may change over time.) I guess what I mean is that origin-as and transit-as are NOT additional dimensions; they're just informational tagging.

I also am considering dropping "shost" from the label list, since all hosts within a particular location (POP) will exhibit the same characteristics from an origin-ip/origin-as/transit-as mapping perspective so that label may be redundant and adding un-necessary time-series divisions. See below for how "loc" and "region" won't cause additional dimensionality as well.

Interval on collection would be maybe 5 minutes, maybe more depending on the speed of insertions that we see on an NVMe-equipped machine.

A separate discussion, though related to labels:

We also considered adding another label and breaking up the metrics into query type instead of storing each as a time series, but we're not sure that's wise. Currently, we have time series for each query type, but I suppose the collectors could be changed to express that as a label instead of a separate time series. This would really cram a lot of dimensionality into the time series, so it makes me a bit wary. What's the preferred method, and why? There are about 40 different DNS record types we'll want to monitor, which makes me lean towards making this a label, as per the best practices but I'm still not confident of the decision.

Current method:

dns_num_query_type_A{env="prod",...

dns_num_query_type_AAA{env="prod",...

...

Possible method:

dns_num_query{qtype="A",env="prod",...

dns_num_query{qtype="AAAA",env="prod",....

...

Another label-based question:

I have two labels of "region" and "loc" which are tightly linked. A location is always going to be in the same region; they're never going to change relationship. However, I found that there was no easy way to do summary lookups in my Grafana queries if I didn't have "region" - otherwise, I was making these huge ugly regexp's that specified every loc contained in a region. Is there an easier way to do this that doesn't involve creating a label that exists only for my querying convenience? It seems wasteful.

Lastly:

I'm aware of the pcap-based methods and existing databases that can capture this information; I'm specifically asking about Prometheus tagging, and not alternate DNS logging in general.

Brian Brazil

unread,

Dec 15, 2016, 3:47:16 PM12/15/16

to john...@gmail.com, Prometheus Users

On 15 December 2016 at 20:34, <john...@gmail.com> wrote:

We're plugging Prometheus into DNS monitoring on a fairly large scale. One of our desired goals is to understand how many queries of various types and methods we're seeing from network ranges as well as the autonomous system neighbors and origins associated with those networks. We are building a collector that will parse our DNS queries into time-series data, and will add labels that give us enough dimensions to perform the queries that we want. We are taking BGP announcements and creating labels based on those announcements, and associating them with counters for various metrics along with the additional ASN origin/transit labels.

Here is an example metric, which stores how many "A" record requests we receive, along with the labels associated with each entry:

dns_num_query_type_A{env="prod", loc="ams", shost="res409", region="eu", origin-ip="10.3.2.0/24", origin-as="1239", transit-as="2914"}

env will have <10 values ("prod", "dev", "test", and a few more TBD)
   loc will have ~180 values (three-digit city alpha codes)
   shost will have hostnames up to ~3000 label values (six digit alphanumeric)
   region has ~8 values (global area, two digit alpha codes)
   origin-ip will have up to =>~700,000 label values which will be a CIDR notation network (i.e.: 10.3.2.0/24)
   origin-as will have up to =>~60000 label values (int64)
   transit-as will have up to =>~3000 label values (int64)

That's a cross-product of 6.8e20, which is many orders of magnitude more than a Prometheus can handle or that can be fit into a modern computer.

You should aim to keep total number of timeseries in a Prometheus no higher than the 10s of millions, as it tends to start to run into difficulty around that point. Queries that touch more than about 1-10k timeseries also can to be slow, depending on hardware.

My question is: What problems am I going to hit with Prometheus using such large label dimensions? Will this work at all?

Obviously, I'm concerned about the "origin-ip" label and the "origin-as" label having so many possible values. There are lot of devils in the details here which make the labels actually a lot less scary than it looks (it's not fully expanded dimensionally) but it's still a really big number.

I've read the warning (https://prometheus.io/docs/practices/naming/) about high cardinality in labels. I don't see any way to get away from it in this case. It's also not exactly "unbounded" since there is a maximum limit on each of the values, and the churn on many of the labels is extremely low.

Why are we doing this? The operational reasons I will leave out of this discussion, since that is an internal issue and you'll have to take my word on it. These are mostly used for "top-10" type queries in Grafana in various ways to help us locate high-volume peers or transit partners, identify DDoS attacks, build white/black lists, and feed some of our other in-house tools that balance/direct traffic based on activity. We wish to be able to perform queries like: "show me the top 10 transit peers in Amsterdam sending us A record requests in the last 24 hours." As far as I can see, this is possible with Prometheus but I'd like to hear if this is actually a workable plan or if I need to move to some other datastore for our short-term monitoring. We've experimented with some data so far, but I always like to get opinions from people with battle scars before I attack an unknown problem.

Note: we're only going to be logging data that we see, so if there is no traffic from a particular network, we will remove that network from the scraped set of data after some interval. If we never see data from a particular network, we'll never log a value for it from any of our servers. This should keep the set of data we're pulling from each DNS resolver to a reasonable number, so we're not pulling a fully expanded set of data from anywhere. Each server would be handing back (as a rough example) 1200 different origin-IP labels, and origin-as and transit-as would never be more than 1 per origin-IP entry (because at any one time, there can only be one origin AS and one transit AS for a network in any site, even though those values may change over time.) I guess what I mean is that origin-as and transit-as are NOT additional dimensions; they're just informational tagging.

I also am considering dropping "shost" from the label list, since all hosts within a particular location (POP) will exhibit the same characteristics from an origin-ip/origin-as/transit-as mapping perspective so that label may be redundant and adding un-necessary time-series divisions. See below for how "loc" and "region" won't cause additional dimensionality as well.

Interval on collection would be maybe 5 minutes, maybe more depending on the speed of insertions that we see on an NVMe-equipped machine.

The lowest sane scrape interval in Prometheus at the moment is about 2 minutes, due to staleness handling.

A separate discussion, though related to labels:

We also considered adding another label and breaking up the metrics into query type instead of storing each as a time series, but we're not sure that's wise. Currently, we have time series for each query type, but I suppose the collectors could be changed to express that as a label instead of a separate time series. This would really cram a lot of dimensionality into the time series, so it makes me a bit wary. What's the preferred method, and why? There are about 40 different DNS record types we'll want to monitor, which makes me lean towards making this a label, as per the best practices but I'm still not confident of the decision.

Current method:
dns_num_query_type_A{env="prod",...
dns_num_query_type_AAA{env="prod",...
...

Possible method:
dns_num_query{qtype="A",env="prod",...
dns_num_query{qtype="AAAA",env="prod",....
...

That's still the same cardinality, and where a label should be used.

Another label-based question:

I have two labels of "region" and "loc" which are tightly linked. A location is always going to be in the same region; they're never going to change relationship. However, I found that there was no easy way to do summary lookups in my Grafana queries if I didn't have "region" - otherwise, I was making these huge ugly regexp's that specified every loc contained in a region. Is there an easier way to do this that doesn't involve creating a label that exists only for my querying convenience? It seems wasteful.

You could be able to use a variant of the approach described at https://www.robustperception.io/exposing-the-software-version-to-prometheus/ Create a timeseries per (region, loc), and then limit your matches with "and on (loc) (region_locs{region='eu'})". However given your high cardinality, having the additional label is best to avoid fetching data for all regions.

Brian

Lastly:
I'm aware of the pcap-based methods and existing databases that can capture this information; I'm specifically asking about Prometheus tagging, and not alternate DNS logging in general.

JT

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4ec69f67-fe6b-4323-9e87-72457eb57c2b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

www.robustperception.io

John Todd

unread,

Dec 15, 2016, 6:03:58 PM12/15/16

to Prometheus Users, john...@gmail.com

On Thursday, December 15, 2016 at 12:47:16 PM UTC-8, Brian Brazil wrote:

On 15 December 2016 at 20:34, <john...@gmail.com> wrote:

We're plugging Prometheus into DNS monitoring on a fairly large scale. One of our desired goals is to understand how many queries of various types and methods we're seeing from network ranges as well as the autonomous system neighbors and origins associated with those networks. We are building a collector that will parse our DNS queries into time-series data, and will add labels that give us enough dimensions to perform the queries that we want. We are taking BGP announcements and creating labels based on those announcements, and associating them with counters for various metrics along with the additional ASN origin/transit labels.

Here is an example metric, which stores how many "A" record requests we receive, along with the labels associated with each entry:

dns_num_query_type_A{env="prod", loc="ams", shost="res409", region="eu", origin-ip="10.3.2.0/24", origin-as="1239", transit-as="2914"}

env will have <10 values ("prod", "dev", "test", and a few more TBD)
   loc will have ~180 values (three-digit city alpha codes)
   shost will have hostnames up to ~3000 label values (six digit alphanumeric)
   region has ~8 values (global area, two digit alpha codes)
   origin-ip will have up to =>~700,000 label values which will be a CIDR notation network (i.e.: 10.3.2.0/24)
   origin-as will have up to =>~60000 label values (int64)
   transit-as will have up to =>~3000 label values (int64)

That's a cross-product of 6.8e20, which is many orders of magnitude more than a Prometheus can handle or that can be fit into a modern computer.
You should aim to keep total number of timeseries in a Prometheus no higher than the 10s of millions, as it tends to start to run into difficulty around that point. Queries that touch more than about 1-10k timeseries also can to be slow, depending on hardware.

OK, thanks. This is the sort of data I was looking for. These two pieces of rule-of-thumb info is what I was after. I think you just did some basic multiplication to get the 6.8e20 figure, which is reasonable as worst-case, and I agree it's pretty staggering if the maximums are used. My explanation didn't really give enough detail as to why this isn't fully expanded, so my apologies on making it seem like this would be many more timeseries than it might be. Not every possible combination of those labels will exist, so it's a much smaller number of series in practical application.

Within our storage duration requirement (using 72 hours for the moment) for any one location (loc) we will probably see (going with some worst-case figures) 5,000 prefixes sending us some type of query. If something within that origin-ip prefix sends us one query, chances are good they'll send us queries across a wide range of query types. The math might look like this if we drop "shost" as a label (which won't cost us any significant usable accuracy):

1 env value (99.99% going to be "prod")

180 loc/region combination values (they're always locked together)

40 qtype values

5000 origin-ip values per location on average (out of a total possible number of ~700,000 values)

1.05 origin-as values per origin-ip prefix on average (changes rarely, out of a total possible number of ~60,000)

2 transit-as values per origin-ip prefix on average (changes rarely, but more frequently than origin-as, out of a total possible number of ~3,000)

That turns into: (1*180*40*5000*1.05*2) = 75,600,000

I think we can trim that number significantly by eliminating tracking on some of the most obscure qtypes and summarizing them into a single "other" counter, to bring that value of 40 down to around 20. If we do that, it brings the total timeseries count to around 38,000,000. I think that will actually be trimmed even more, since the "5000 origin-ip values per location" is probably above the 95th percentile and not truly an "average", and it's a sharply declining curve. The average is probably much less than 5k, but we won't know until we start to collect the data. We'll also probably see the transit-as value be much more stable and lower than I describe above, so that will hopefully cut a large chunk again out of the total number of timeseries.

I think your comment about "no higher than 10s of millions" is the valuable bit here - we can exclude or summarize label values to keep the total figure down in in the tens of millions of possible permutations. Our lookups will probably not touch more than 10k timeseries on a regular basis, or the ones that do will be infrequently performed and understood to be slow.

My question is: What problems am I going to hit with Prometheus using such large label dimensions? Will this work at all?

Obviously, I'm concerned about the "origin-ip" label and the "origin-as" label having so many possible values. There are lot of devils in the details here which make the labels actually a lot less scary than it looks (it's not fully expanded dimensionally) but it's still a really big number.

I've read the warning (https://prometheus.io/docs/practices/naming/) about high cardinality in labels. I don't see any way to get away from it in this case. It's also not exactly "unbounded" since there is a maximum limit on each of the values, and the churn on many of the labels is extremely low.

Why are we doing this? The operational reasons I will leave out of this discussion, since that is an internal issue and you'll have to take my word on it. These are mostly used for "top-10" type queries in Grafana in various ways to help us locate high-volume peers or transit partners, identify DDoS attacks, build white/black lists, and feed some of our other in-house tools that balance/direct traffic based on activity. We wish to be able to perform queries like: "show me the top 10 transit peers in Amsterdam sending us A record requests in the last 24 hours." As far as I can see, this is possible with Prometheus but I'd like to hear if this is actually a workable plan or if I need to move to some other datastore for our short-term monitoring. We've experimented with some data so far, but I always like to get opinions from people with battle scars before I attack an unknown problem.

Note: we're only going to be logging data that we see, so if there is no traffic from a particular network, we will remove that network from the scraped set of data after some interval. If we never see data from a particular network, we'll never log a value for it from any of our servers. This should keep the set of data we're pulling from each DNS resolver to a reasonable number, so we're not pulling a fully expanded set of data from anywhere. Each server would be handing back (as a rough example) 1200 different origin-IP labels, and origin-as and transit-as would never be more than 1 per origin-IP entry (because at any one time, there can only be one origin AS and one transit AS for a network in any site, even though those values may change over time.) I guess what I mean is that origin-as and transit-as are NOT additional dimensions; they're just informational tagging.

I also am considering dropping "shost" from the label list, since all hosts within a particular location (POP) will exhibit the same characteristics from an origin-ip/origin-as/transit-as mapping perspective so that label may be redundant and adding un-necessary time-series divisions. See below for how "loc" and "region" won't cause additional dimensionality as well.

Interval on collection would be maybe 5 minutes, maybe more depending on the speed of insertions that we see on an NVMe-equipped machine.

The lowest sane scrape interval in Prometheus at the moment is about 2 minutes, due to staleness handling.

Noted. We'll plan for this.

Two minutes may actually be problematic in some locations, given the incredibly slow bandwidth to some of the more remote locations - we may end up with one scrape collection overlapping the previous cycle which may not have finished. This will be the case only in a few spots, and I suppose we'll just have some spotty data there.

There is an entirely different story about the kbps we'll be pulling during collection of these metrics, but that's more of an interesting academic question than an actual operational concern. With a guess of 50 bytes per timeseries metric collected, and assuming some of the figures I use above = 40Mbps for measurement at full tilt, not including TCP overhead. What's an average "on-the-wire" number of bytes for a metric collection?

I posted a comment on the thread about long-term storage (https://github.com/prometheus/prometheus/issues/10) which is semi-relevant here. While our intention right now is just to collect data for near-real-time monitoring, it would be fantastic to push data out of Prometheus into a longer-term trending system which was compatible with the identical kinds of queries and tools (Grafana) that we're performing for short-term monitoring. This would allow us to downsample and remove the high resolution data to a longer-term backend, which would free up "fast" disk for higher sample rates for short-term monitoring.

Currently, we're not planning on storing any of this per-network data for very long, but I suspect as soon as we make it available there will quickly develop a keen interest by the stats folks in storing this data over long intervals for comparative use. There are other data sets we're currently collecting in Prometheus that we already need to store for long intervals, and there is currently no clear answer on how we are going to manage either data storage need other than "buy more disk for Prometheus" or "The data is ephemeral and you'll just have to manage." The worst possible case is "Abandon Prometheus and go with something else that is less a monitoring system and more a TSDB." Our goal would be not to develop another parallel or replacement tool for these requirements and just re-use what we have for rapid and high volume queries which then gracefully and transparently turns into long-term storage accessible by the same interface(s). And a pony.

A separate discussion, though related to labels:

We also considered adding another label and breaking up the metrics into query type instead of storing each as a time series, but we're not sure that's wise. Currently, we have time series for each query type, but I suppose the collectors could be changed to express that as a label instead of a separate time series. This would really cram a lot of dimensionality into the time series, so it makes me a bit wary. What's the preferred method, and why? There are about 40 different DNS record types we'll want to monitor, which makes me lean towards making this a label, as per the best practices but I'm still not confident of the decision.

Current method:
dns_num_query_type_A{env="prod",...
dns_num_query_type_AAA{env="prod",...
...

Possible method:
dns_num_query{qtype="A",env="prod",...
dns_num_query{qtype="AAAA",env="prod",....
...

That's still the same cardinality, and where a label should be used.

OK, will do.

Another label-based question:

I have two labels of "region" and "loc" which are tightly linked. A location is always going to be in the same region; they're never going to change relationship. However, I found that there was no easy way to do summary lookups in my Grafana queries if I didn't have "region" - otherwise, I was making these huge ugly regexp's that specified every loc contained in a region. Is there an easier way to do this that doesn't involve creating a label that exists only for my querying convenience? It seems wasteful.

You could be able to use a variant of the approach described at https://www.robustperception.io/exposing-the-software-version-to-prometheus/ Create a timeseries per (region, loc), and then limit your matches with "and on (loc) (region_locs{region='eu'})". However given your high cardinality, having the additional label is best to avoid fetching data for all regions.

I'm interpreting this to mean that I should stick with the model I have so that's fine, though I wasn't aware of the method you linked - looks interesting, if a bit confusing at first.

Thanks for the answers!

Brian

Lastly:
I'm aware of the pcap-based methods and existing databases that can capture this information; I'm specifically asking about Prometheus tagging, and not alternate DNS logging in general.

JT

--