Thanks for the doc. I have commented on it, but while doing so, I felt
the urge to comment more generally, which would not fit well into the
margin of a Google doc. My thoughts are also a bit out of scope of
Rob's design doc and more about the general topic of remote write and
the equally general topic of metadata (about which we have an ongoing
discussion among the Prometheus developers).
Disclaimer: I don't know the remote-write protocol very well. My hope
here is that my somewhat distant perspective is of some value as it
allows to take a step back. However, I might just miss crucial details
that completely invalidate my thoughts. We'll see...
I do care a lot about metadata, though. (And ironically, the reason
why I declared remote write "somebody else's problem" is that I've
always disliked how it fundamentally ignores metadata.)
Rob's document embraces the fact that metadata can change over time,
but it assumes that at any given time, there is only one set of
metadata per unique metric name. It takes into account that there can
be drift, but it considers them an irregularity that will only happen
occasionally and iron out over time.
In practice, however, metadata can be legitimately and deliberately
different for different time series of the same name. Instrumentation
libraries and even the exposition format inherently require one set of
metadata per metric name, but this is all only enforced (and meant to
be enforced) _per target_. Once the samples are ingested (or even sent
onwards via remote write), they have no notion of what target they
came from. Furthermore, samples created by rule evaluation don't have
an originating target in the first place. (Which raises the question
of metadata for recording rules, which is another can of worms I'd
like to open eventually...)
(There is also the technical difficulty that the WAL has no notion of
bundling or referencing all the series with the same metric name. That
was commented about in the doc but is not my focus here.)
Rob's doc sees TYPE as special because it is so cheap to just add to
every data point. That's correct, but it's giving me an itch: Should
we really create different ways of handling metadata, depending on its
expected size?
Compare this with labels. There is no upper limit to their number or
size. Still, we have no plan of treating "large" labels differently
from "short" labels.
On top of that, we have by now gained the insight that metadata is
changing over time and essentially has to be tracked per series.
Or in other words: From a pure storage perspective, metadata behaves
exactly the same as labels! (There are certainly huge differences
semantically, but those only manifest themselves on the query level,
i.e. how you treat it in PromQL etc.)
(This is not exactly a new insight. This is more or less what I said
during the 2016 dev summit, when we first discussed remote write. But
I don't want to dwell on "told you so" moments... :o)
There is a good reason why we don't just add metadata as "pseudo
labels": As discussed a lot in the various design docs including Rob's
one, it would blow up the data size significantly because HELP strings
tend to be relatively long.
And that's the point where I would like to take a step back: We are
discussing to essentially treat something that is structurally the
same thing in three different ways: Way 1 for labels as we know
them. Way 2 for "small" metadata. Way 3 for "big" metadata.
However, while labels tend to be shorter than HELP strings, there is
the occasional use case with long or many labels. (Infamously, at
SoundCloud, a binary accidentally put a whole HTML page into a
label. That wasn't a use case, it was a bug, but the Prometheus server
ingesting that was just chugging along as if nothing special had
happened. It looked weird in the expression browser, though...) I'm
sure any vendor offering Prometheus remote storage as a service will
have a customer or two that use excessively long label names. If we
have to deal with that, why not bite the bullet and treat metadata in
the same way as labels in general? Or to phrase it in another way: Any
solution for "big" metadata could be used for labels, too, to
alleviate the pain with excessively long label names.
Or most succintly: A robust and really good solution for
"big" metadata in remote write will make remote write much more
efficient if applied to labels, too.
Imagine an NALSD tech interview question that boils down to "design
Prometheus remote write". I bet that most of the better candidates
will recognize that most of the payload will consist of series
indentifiers (call them labels or whatever) and they will suggest to
first transmit some kind of index and from then on only transmit short
series IDs. The best candidates will then find out about all the
problems with that: How to keep the protocol stateless, how to re-sync
the index, how to update it if new series arrive etc. Those are
certainly all good reasons why remote write as we know it does not
transfer an index of series IDs.
However, my point here is that we are now discussing exactly those
problems when we talk about metadata transmission. Let's solve those
problems and apply them to remote write in general!
Some thoughts about that:
Current remote write essentially transfers all labels for _every_
sample. This works reasonably well. Even if metadata blows up the data
size by 5x or 10x, transfering the whole index of metadata and labels
should remain feasible as long as we do it less frequently than once
every 10 samples. It's something that could be done each time a
remote-write receiver connects. From then on, we "only" have to track
when new series (or series with new metadata) show up and transfer
those. (I know it's not trivial, but we are already discussing
possible solutions in the various design docs.) Whenever a
remote-write receiver gets out of sync for some reason, it can simply
cut the connection and start with a complete re-sync again. As long as
that doesn't happen more often than once every 10 samples, we still
have a net gain. Combining this with sharding is another challenge,
but it doesn't appear unsolveable.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in