Collected reasons why Prometheus doesn't allow dot as a regular character in metric and label names

56 views
Skip to first unread message

Bjoern Rabenstein

unread,
May 29, 2024, 9:05:04 AMMay 29
to Prometheus Developers
In a recent thread on this mailing list ("Limiting the blast radius of
OTel..."), several people once again suggested that Prometheus should
just allow the dot (`.`) as a regular character in metric and label
names and be done with it. I responded that we have discussed this
topic countless times, always with the result of not doing it
(yet). Of course, we are free to reopen the discussion as often as
anyone wishes (and in fact, one argument in the past was that we
should first introduce full UTF-8 capabilities via quoting and see how
it goes, and then we can still consider "graduating" selected
characters to regular characters that can be used without quoting).

However, the reason for this mail is that I also said that I won't
reiterate all the points made over and over again. After that, an
individual approached me and asked where they could read up about
those points, and I realized that they are hard to find in documented
form. (My vague memory was that I already wrote a mail like this in
the past, but I cannot find it anymore, and the relevant notes from
dev-summits are not detailed and structured enough to serve as a
reference.)

Therefore, I'll reiterate all those points one more time so that we
don't have to do it again in the future. Please amend this list if you
find any omissions. [In this list, I also tried to say something about
the relevance of each point. This is marked by square brackets.]

1. The probably oldest reason is a plan for a short-form notation of
the job label. `requests_total{job="api"}` could be written as
`requests_total.api`. This originates from an ancient internal
Google practice. [I don't think that this point has any relevance
anymore. The job label is now considered way less special than
traditionally. Additionally, the short form would only work if the
value of the job label follows the same character restrictions as
names, which would cause confusion for sure when it doesn't.]

2. In the early years of Prometheus, the statsd/Graphite stack was
very relevant. Dots play a very special role there. In contrast,
even if we had allowed dots in Prometheus names from the beginning,
they would just have been characters as all the
others. Superficially, it would have looked like better
interoperability, but it would not have lived up to its implied
promises, because Graphite-style globbing would not have worked,
the metrics would not have had an actual hierachy like in the
Graphite data model etc. [This point is much weaker nowadays
because most users are probably more familiar with the
Prometheus-style label based data model than with the hierarchical
Graphite data model. I wouldn't expect much confusion because of
that. However, this point still illustrates the fundamental problem
of turning a character that is part of the actual syntax and
arguably even a real operator in one system into "just another
character" in an opaque string in the other system, where the
syntactic meaning only exist as a convention among humans. This is
also relevant for some of the other points below.]

3. Naming is a hard problem, as we all know. Many of the early
Prometheus contributors had rich experience with running complex
systems at scale. They all got burned by the fact that our brains
are really bad at remembering if something was named `foo-bar-baz`
or `foo_bar_baz` or `foo.bar.baz` or `foo/bar/baz` (or even
`foo_bar.baz`), especially in the heat of fighting an
outage. Following the "simple, light-weight, opinionated" paradigm
(once more many thanks to Julius to have expressed it so concisely
recently), Prometheus decided to have one and only one separator
character. In addition, this one separator character isn't really
special in a lot of languages, so names from the Prometheus
ecosystem would translate into names in other contexts easily
(initially and practically most relevant for Go templating, but the
idea works in a much wider scope). (One might come up with the
counter argument that Prometheus also allows `:` as a
separator. That's indeed a deviation from the fundamental idea. `:`
is meant only for rules, but that's just a convention and not
enforced by syntax. However, it has worked quite well for all those
years, presumably because people rarely use `:` as a separator
character by accident.) OTel semantic conventions are the
antithesis of this: They introduce two different separator
characters with a slightly different meaning (`.` for "namespaces",
but they aren't really namespaces, more about that below). And they
use a character that has a special meaning in a lot of
languages. (Coming back to the Go templating example:
`$labels.service_instance_id` is valid,
`$labels.service.instance.id` is not. It forces you to jump through
hoops and write `index $labels "service.instance.id"`. Similar
issues will occur in many other languages.) [This might appear a
minor annoyance to many, but in my experience, it creates a huge
deal of peace of mind in the long run. This is also a good example
why it is useful to mark `.` as special via requiring the quoting
syntax. If we allowed `.` as a regular character, it will
inevitably show up even in use cases that are untouched by OTel's
semantic conventians, defeating the idea of "one and only one
separator character". In a way, the effort of quoting protects
regular Prometheus users from the `.` "pollution". Or in other
words: By allowing `.` as a regular character, we would make the
life of regular Prometheus users harder to accommodate OTel needs
originating from a questionable decision.]

4. Much more vague than (1), but there have been thoughts about
"proper" namespaces for a long time. The weird namespace concept in
client_golang is a witness from the distant past, but that
namespacing appears more like a joke in hindsight and never got
traction. By now, it has become more of an annoyance we want to get
rid of (but ironically, it is very similar to the "namespace"
concept of OTel's semantic conventions). What makes a namespace
"proper"? Maybe it's about the ability to be "inside" a namespace
so you don't have to add the prefix or suffix all the time. Or it's
about the namespaces to be indexed ("apply this query only to
metrics in that namespace"). But most importantly, a namespace must
come with an unambiguous syntax, which mostly boils down to having
a namespace operator. The most common namespace operator is
probably `.`, and that has been a good reason to reserve it in
Prometheus. OTel's semantic conventions claim to use `.` for
namespacing, too, but it's not an operator, it's just a
convention. Which leads to weird stipulations like this one
(https://opentelemetry.io/docs/specs/semconv/general/attribute-naming/):
"Names SHOULD NOT coincide with namespaces. For example if
service.instance.id is an attribute name then it is no longer valid
to have an attribute named service.instance because
service.instance is already a namespace. Because of this rule be
careful when choosing names: every existing name prohibits
existence of an equally named namespace in the future, and vice
versa: any existing namespace prohibits existence of an equally
named attribute key in the future." If `.` were a real namespace
operator, you simply would not have this problem. [It's obviously a
weak claim to block a feature in the present to keep open the
option for a vaguely planned feature in the future. Furthermore, we
could just use another character for the namespace
operator. (Although C++ style `::` wouldn't work because `:` is
already a regular character in names. And having a "weird"
namespace operator next to `.` as a regular character will be
confusing.) Still, I think proper namespacing would be so nice that
we shouldn't dismiss it easily. In this context, it's doubly
annoying that we cannot even just interpret the `.` coming from
OTel as a "true" namespace operator because OTel _also_ allows it
as a regular character. You can never know if a `.` coming from
OTel is meant as a namespace separator, so you would treat it as
such in your powerful namespace-enabled backend, or if it is just
part of the name.]

5. Prometheus has been plagued by magic suffixes from the very
beginning. In my understanding of Prometheus history, suffixing
components of a summary or a histogram with `_count`, `_sum`,
`_bucket` was a means to get an MVP running. I think it was a
mistake to reify this concept (originally constrained to TSDB and
PromQL) by letting it leak into the exposition format. OpenMetrics
made things worse by introducing more magic suffixes (`_info` got
introduced, and `_total` got a promotion from a mere recommendation
to another magic suffix, and arguably the same happened to the
unit). The problem with magic suffixes is namespace pollution: It
prevents usage of any of the magic suffixes in metric names. Or to
be precise: It's even worse, the usage is not really technically
forbidde. You can still do it, but then you mught run into
surprising and confusing namespace collisions that might very well
show up in the worst of moments. A way out of this is to use a
separator character for the magic suffixes that is _not_ a valid
character in names otherwise. And now guess which character comes
to mind for that... [This is another point in the category "future
feature has a hard time blocking a feature proposal for the
present". This was concretely considered when OpenMetrics was
designed, but it got rejected by the OpenMetrics team. So there is
a non-zero chance that it will be on the table again when we try to
"fix" OpenMetrics. A counter point is again that another character
could be used, at the price of using something that is much less
intuitive to learn and read.]

6. Native histograms introduced the first instance of a "structured
metric", but more could happen in the future. Accessing "fields" in
this structure is currently done by bespoke functions in PromQL
(`histogram_count(request_latency_seconds)`,
`histogram_sum(request_latency_seconds)`), but it would be a quite
obvious alternative to allow something like
`request_latency_seconds.count` and `request_latency_seconds.sum`,
which is actually more than just syntactic sugar because we could
implement "field access" as a different thing from "function
call". The latter is an evaluation, it changes the timestamp, and
it cannot be used in a range selector
(`histogram_count(request_latency_seconds)[5m]` is invalid syntax),
while we could make `request_latency_seconds.count[5m]` valid
without changing the language fundamentals. [This is much more
tangible than (4) and (5), but not decided yet, so it still asks us
to reject a concrete feature request in the name of a possible
future feature. And again, there are other ways of implementing
this, avoiding the usage of a `.` operator, at the price of not
doing the most obvious.]

In summary, (4), (5), and (6) are all more or less vague, but they are
close to my heart, as I'm continuously thinking about future
improvements of PromQL in particular and Prometheus in general. I
should also note that it isn't clear if all three ideas can be
combined or if they are actually mutually exclusive. So the argument
is not so much "dots will kill three possible features at the same
time", but more like "even though those ideas are somewhat vague and
possibly mutually exclusive, there are so many of them that it's
likely wi will implement at least one of them in the not too far
future". (1) is IMHO irrelevant. (2) nicely illustrates a bigger
fundamental problem, but the concrete reference to Graphite is mostly
historical. Which leaves us with (3) as the most generally applicable
argument to be made, at least if we cut out the visionary part (that
might just live in my head).

As the final point, I would like to circle back to the very beginning
of the mail: The last consensus on `.` was that we want to implement
the full UTF-8 support via quoting first and see how it plays out "in
the wild". Only then we can see how well it really works (or how
badly), and based on that, we can make a much better trade-off about
the damage and benefits of introducing `.` as a regular character. I
would propose to do exactly that and wait for a bit longer (hopefully
just a few months) rather than pushing for a decision now.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in
Reply all
Reply to author
Forward
0 new messages