Remote Write Metadata propagation

451 views
Skip to first unread message

Rob Skillington

unread,
Jul 16, 2020, 3:28:05 PM7/16/20
to prometheus...@googlegroups.com
Firstly: Thanks a lot for sharing the dev summit notes, they are greatly appreciated. Also thank you for a great PromCon!

In regards to prometheus remote write metadata propagation consensus, is there any plans/projects/collaborations that can be done to perhaps plan work on a protocol that might help others in the ecosystem offer the same benefits to Prometheus ecosystem projects that operate on a per write request basis (i.e. stateless processing of a write request)?

I understand https://github.com/prometheus/prometheus/pull/6815 unblocks feature development on top of Prometheus for users with specific architectures, however it is a non-starter for a lot of other projects, especially for third party exporters to systems that are unowned by end users (i.e. writing a StackDriver remote write endpoint that targeted StackDriver, the community is unable to change the implementation of StackDriver itself to cache/statefully make metrics metadata available at ingestion time to StackDriver).

Obviously I have a vested interest since as a remote write target, M3 has several stateless components before TSDB ingestion and flowing the entire metadata to a distributed set of DB nodes that own a different set of the metrics space from each other node this has implications on M3 itself of course too (i.e. it is non-trivial to map metric name -> DB node without some messy stateful cache sitting somewhere in the architecture which adds operational burdens to end users).

I suppose what I'm asking is, are maintainers open to a community request that duplicates some of https://github.com/prometheus/prometheus/pull/6815 but sends just metric TYPE and UNIT per datapoint (which would need to be captured by the WAL if feature is enabled) to a backend so it can statefully be processed correctly without needing a sync of a global set of metadata to a backend?

And if not, what are the plans here and how can we collaborate to make this data useful to other consumers in the Prometheus ecosystem.

Best intentions,
Rob

Rob Skillington

unread,
Jul 16, 2020, 3:43:19 PM7/16/20
to prometheus...@googlegroups.com
Typo: "community request" should be: "community contribution that duplicates some of PR 6815"

Chris Marchbanks

unread,
Jul 16, 2020, 4:39:22 PM7/16/20
to Rob Skillington, Prometheus Developers
Hi Rob,

I would also like metadata to become stateless, and view 6815 only as a first step, and the start of an output format. Currently, there is a work in progress design doc, and another topic for an upcoming dev summit, for allowing use cases where metadata needs to be in the same request as the samples.

Generally, I (and some others I have talked to) don't want to send all the metadata with every sample as that is very repetitive, specifically for histograms and metrics with many series. Instead, I would like remote write requests to become transaction based, at which point all the metadata from that scrape/transaction can be added to the metadata field introduced to the proto in 6815 and then each sample can be linked to a metadata entry without as much duplication. That is very broad strokes, and I am sure it will be refined or changed completely with more usage.

That said, TYPE and UNIT are much smaller than metric name and help text, and I would support adding those to a linked metadata entry before remote write becomes transactional. Would that satisfy your use cases?

Chris

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABakzZbvZeyKLXfK08aiXgGcZso%3D8A0H1JBT9jwBzf6rCiUmVw%40mail.gmail.com.

Rob Skillington

unread,
Jul 21, 2020, 5:49:30 PM7/21/20
to Chris Marchbanks, Prometheus Developers
Hey Chris,

Apologies on the delay to your response.

Yes I think that even just TYPE would be a great first step. I am working on a very small one pager that outlines perhaps how we get from here to that future you talk about.

In terms of downstream processing, just having the TYPE on every single sample would be a huge step forward as it enables the ability to do stateless processing of the metric (i.e. downsampling and working out whether counter resets need to be detected during downsampling of this single individual sample).

Also you can imagine this enables the ability to suggest certain functions that can be applied, i.e. auto-suggest rate(...) should be applied without needing to analyze or use best effort heuristics of the actual values of a time series.

Completely agreed that solving this for UNIT and HELP is more difficult and that information would likely be much nicer to be sent/stored per metric name rather than per time-series sample.

I'll send out the Google doc for some comments shortly.

Transactional approach is interesting, it could be difficult given that this information can flap (i.e. start with some value for HELP/UNIT but a different target of the same application has a different value) and hence that means ordering is important and dealing with transactional order could be a hard problem. I agree that making this deterministic if possible would be great. Maybe it could be something like a token that is sent alongside the first remote write payload, and if that continuation token that the receiver sees means it missed some part of the stream then it can go and do a full sync and from there on in receive updates/additions in a transactional way from the stream over remote write. Just a random thought though and requires more exploration / different solutions being listed to weigh up pros/cons/complexity/etc.

Best,
Rob


Rob Skillington

unread,
Jul 21, 2020, 5:55:32 PM7/21/20
to Chris Marchbanks, Prometheus Developers
Also want to point out that with just TYPE you can do things such as know it's a histogram type and then suggest using "sum(rate(...)) by (le)" with a one click button in a UI which again is significantly harder without that information.

The reason it becomes important though is some systems (i.e. StackDriver) require this schema/metric information the first time you record a sample. So you really want the very basics of it the first time you receive that sample (i.e. at least TYPE): 

Defines a metric type and its schema. Once a metric descriptor is created, deleting or altering it stops data collection and makes the metric type's existing data unusable.
The following are specific rules for service defined Monitoring metric descriptors:
type, metricKind, valueType and description fields are all required. The unit field must be specified if the valueType is any of DOUBLE, INT64, DISTRIBUTION.
Maximum of default 500 metric descriptors per service is allowed.
Maximum of default 10 labels per metric descriptor is allowed.

Just an example, but other systems and definitely systems that want to do processing of metrics on the way in would prefer at very least things like TYPE and maybe ideally UNIT too are specified.

Rob Skillington

unread,
Aug 3, 2020, 3:04:50 AM8/3/20
to Chris Marchbanks, Prometheus Developers
Ok - I have a proposal which could be broken up into two pieces, first delivering TYPE per datapoint, the second consistently and reliably HELP and UNIT once per unique metric name:
https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#heading=h.bik9uwphqy3g

Would love to get some feedback on it. Thanks for the consideration. Is there anyone in particular I should reach out to ask for feedback from directly?

Best,
Rob

Bjoern Rabenstein

unread,
Aug 6, 2020, 2:01:23 PM8/6/20
to Rob Skillington, Chris Marchbanks, Prometheus Developers
On 03.08.20 03:04, Rob Skillington wrote:
> Ok - I have a proposal which could be broken up into two pieces, first
> delivering TYPE per datapoint, the second consistently and reliably HELP and
> UNIT once per unique metric name:
> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
> /edit#heading=h.bik9uwphqy3g

Thanks for the doc. I have commented on it, but while doing so, I felt
the urge to comment more generally, which would not fit well into the
margin of a Google doc. My thoughts are also a bit out of scope of
Rob's design doc and more about the general topic of remote write and
the equally general topic of metadata (about which we have an ongoing
discussion among the Prometheus developers).

Disclaimer: I don't know the remote-write protocol very well. My hope
here is that my somewhat distant perspective is of some value as it
allows to take a step back. However, I might just miss crucial details
that completely invalidate my thoughts. We'll see...

I do care a lot about metadata, though. (And ironically, the reason
why I declared remote write "somebody else's problem" is that I've
always disliked how it fundamentally ignores metadata.)

Rob's document embraces the fact that metadata can change over time,
but it assumes that at any given time, there is only one set of
metadata per unique metric name. It takes into account that there can
be drift, but it considers them an irregularity that will only happen
occasionally and iron out over time.

In practice, however, metadata can be legitimately and deliberately
different for different time series of the same name. Instrumentation
libraries and even the exposition format inherently require one set of
metadata per metric name, but this is all only enforced (and meant to
be enforced) _per target_. Once the samples are ingested (or even sent
onwards via remote write), they have no notion of what target they
came from. Furthermore, samples created by rule evaluation don't have
an originating target in the first place. (Which raises the question
of metadata for recording rules, which is another can of worms I'd
like to open eventually...)

(There is also the technical difficulty that the WAL has no notion of
bundling or referencing all the series with the same metric name. That
was commented about in the doc but is not my focus here.)

Rob's doc sees TYPE as special because it is so cheap to just add to
every data point. That's correct, but it's giving me an itch: Should
we really create different ways of handling metadata, depending on its
expected size?

Compare this with labels. There is no upper limit to their number or
size. Still, we have no plan of treating "large" labels differently
from "short" labels.

On top of that, we have by now gained the insight that metadata is
changing over time and essentially has to be tracked per series.

Or in other words: From a pure storage perspective, metadata behaves
exactly the same as labels! (There are certainly huge differences
semantically, but those only manifest themselves on the query level,
i.e. how you treat it in PromQL etc.)

(This is not exactly a new insight. This is more or less what I said
during the 2016 dev summit, when we first discussed remote write. But
I don't want to dwell on "told you so" moments... :o)

There is a good reason why we don't just add metadata as "pseudo
labels": As discussed a lot in the various design docs including Rob's
one, it would blow up the data size significantly because HELP strings
tend to be relatively long.

And that's the point where I would like to take a step back: We are
discussing to essentially treat something that is structurally the
same thing in three different ways: Way 1 for labels as we know
them. Way 2 for "small" metadata. Way 3 for "big" metadata.

However, while labels tend to be shorter than HELP strings, there is
the occasional use case with long or many labels. (Infamously, at
SoundCloud, a binary accidentally put a whole HTML page into a
label. That wasn't a use case, it was a bug, but the Prometheus server
ingesting that was just chugging along as if nothing special had
happened. It looked weird in the expression browser, though...) I'm
sure any vendor offering Prometheus remote storage as a service will
have a customer or two that use excessively long label names. If we
have to deal with that, why not bite the bullet and treat metadata in
the same way as labels in general? Or to phrase it in another way: Any
solution for "big" metadata could be used for labels, too, to
alleviate the pain with excessively long label names.

Or most succintly: A robust and really good solution for
"big" metadata in remote write will make remote write much more
efficient if applied to labels, too.

Imagine an NALSD tech interview question that boils down to "design
Prometheus remote write". I bet that most of the better candidates
will recognize that most of the payload will consist of series
indentifiers (call them labels or whatever) and they will suggest to
first transmit some kind of index and from then on only transmit short
series IDs. The best candidates will then find out about all the
problems with that: How to keep the protocol stateless, how to re-sync
the index, how to update it if new series arrive etc. Those are
certainly all good reasons why remote write as we know it does not
transfer an index of series IDs.

However, my point here is that we are now discussing exactly those
problems when we talk about metadata transmission. Let's solve those
problems and apply them to remote write in general!

Some thoughts about that:

Current remote write essentially transfers all labels for _every_
sample. This works reasonably well. Even if metadata blows up the data
size by 5x or 10x, transfering the whole index of metadata and labels
should remain feasible as long as we do it less frequently than once
every 10 samples. It's something that could be done each time a
remote-write receiver connects. From then on, we "only" have to track
when new series (or series with new metadata) show up and transfer
those. (I know it's not trivial, but we are already discussing
possible solutions in the various design docs.) Whenever a
remote-write receiver gets out of sync for some reason, it can simply
cut the connection and start with a complete re-sync again. As long as
that doesn't happen more often than once every 10 samples, we still
have a net gain. Combining this with sharding is another challenge,
but it doesn't appear unsolveable.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Callum Styan

unread,
Aug 6, 2020, 3:42:55 PM8/6/20
to Bjoern Rabenstein, Rob Skillington, Chris Marchbanks, Prometheus Developers
Thanks Rob for putting this proposal together, I think it highlights some features of what we want metadata RW and remote write in general to look like in the future. As others have pointed out (thanks Björn for giving such a detailed description) there's issues with the way Prometheus currently handles metadata that need to be thought about and handled differently when storing metadata in the WAL or in long term storage. I didn't make many more comments as most of what I wanted to say had already been mentioned by others.

 As part of thinking about how to get metadata and exemplars into remote write, some of us have been discussing what we've been calling 'the future of remote write'. While there's nothing formal yet, I will be starting a brainstorming/design doc soon and would appreciate your input there Rob. 

Rob Skillington

unread,
Aug 6, 2020, 5:58:58 PM8/6/20
to Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Hey Björn,


Thanks for the detailed response. I've had a few back and forths on this with
Brian and Chris over IRC and CNCF Slack now too.

I agree that fundamentally it seems naive to idealistically model this around
per metric name. It needs to be per series given what may happen w.r.t.
collision across targets, etc.

Perhaps we can separate these discussions apart into two considerations:

1) Modeling of the data such that it is kept around for transmission (primarily
we're focused on WAL here).

2) Transmission (and of which you allude to has many areas for improvement).

For (1) - it seems like this needs to be done per time series, thankfully we
actually already have modeled this to be stored per series data just once in a
single WAL file. I will write up my proposal here, but it will surmount to
essentially encoding the HELP, UNIT and TYPE to the WAL per series similar to
how labels for a series are encoded once per series in the WAL. Since this
optimization is in place, there's already a huge dampening effect on how
expensive it is to write out data about a series (e.g. labels). We can always
go and collect a sample WAL file and measure how much extra size with/without
HELP, UNIT and TYPE this would add, but it seems like it won't fundamentally 
change the order of magnitude in terms of "information about a timeseries 
storage size" vs "datapoints about a timeseries storage size". One extra change
would be re-encoding the series into the WAL if the HELP changed for that
series, just so that when HELP does change it can be up to date from the view
of whoever is reading the WAL (i.e. the Remote Write loop). Since this entry
needs to be loaded into memory for Remote Write today anyway, with string
interning as suggested by Chris, it won't change the memory profile
algorithmically of a Prometheus with Remote Write enabled. There will be some
overhead that at most would likely be similar to the label data, but we aren't
altering data structures (so won't change big-O magnitude of memory being used),
we're adding fields to existing data structures that exist and string interning
should actually make it much less onerous since there is a large duplicative
effect with HELP among time series.

For (2) - now we have basically TYPE, HELP and UNIT all available for
transmission if we wanted to send it with every single datapoint. While I think
we should definitely examine HPACK like compression features as you mentioned 
Björn, I think we should think more about separating that kind of work into a
Milestone 2 where this is considered. For the time being it's very plausible
we could do some negotiation of the receiving Remote Write endpoint by sending
a "GET" to the remote write endpoint and seeing if it responds with a
"capabilities + preferences" response, and if the endpoint specifies that it
would like to receive metadata all the time on every single request and let
Snappy take care of keeping size not ballooning too much, or if it would like
TYPE on every single datapoint, and HELP and UNIT every DESIRED_SECONDS or so.
To enable a "send HELP every 10 minutes" feature we would have to add to the
datastructure that holds the LABELS, TYPE, HELP and UNIT for each series a
"last sent" timestamp to know when to resend to that backend, but that seems
entirely plausible and would not use more than 4 extra bytes.

These thoughts are based on the discussion I've had and the thoughts on this
thread. What's the feedback on this before I go ahead and re-iterate the design
to more closely map to what I'm suggesting here?

Best,
Rob

Rob Skillington

unread,
Aug 6, 2020, 6:04:07 PM8/6/20
to Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Hey Callum,

Apologies missed your response as was typing back to Björn. 

Look forward to seeing your document, sounds good. As I mentioned in my previous
email I think that there's definitely "further work" area. I'd like to get just
TYPE and if it's not too difficult then HELP and UNIT at least too flowing
sooner than that timeline however, and we have folks ready to contribute to
work in this space right now.

Would love to hear your thoughts on my latest proposal as sent with the last
email.

Best,
Rob

Brian Brazil

unread,
Aug 7, 2020, 3:56:14 AM8/7/20
to Rob Skillington, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Negotiation is fundamentally stateful, as the process that receives the first request may be a very different one from the one that receives the second - such as if an upgrade is in progress. Remote write is intended to be a very simple thing that's easy to implement on the receiver end and is a send-only request-based protocol, so request-time negotiation is basically out. Any negotiation needs to happen via the config file, and even then it'd be better if nothing ever needed to be configured. Getting all the users of a remote write to change their config file or restart all their Prometheus servers is not an easy task after all.

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Rob Skillington

unread,
Aug 7, 2020, 10:48:38 AM8/7/20
to Brian Brazil, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
True - I mean this could also be a blacklist by config perhaps, so if you
really don't want to have increased egress you can optionally turn off sending
the TYPE, HELP, UNIT or send them at different frequencies via config. We could
package some sensible defaults so folks don't need to update their config.

The main intention is to enable these added features and make it possible for
various consumers to be able to adjust some of these parameters if required
since backends can be so different in their implementation. For M3 I would be
totally fine with the extra egress that should be mitigated fairly considerably
by Snappy and the fact that HELP is common across certain metric families and
receiving it every single Remote Write request.

Brian Brazil

unread,
Aug 7, 2020, 2:09:04 PM8/7/20
to Rob Skillington, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On Fri, 7 Aug 2020 at 15:48, Rob Skillington <r...@chronosphere.io> wrote:
True - I mean this could also be a blacklist by config perhaps, so if you
really don't want to have increased egress you can optionally turn off sending
the TYPE, HELP, UNIT or send them at different frequencies via config. We could
package some sensible defaults so folks don't need to update their config.

The main intention is to enable these added features and make it possible for
various consumers to be able to adjust some of these parameters if required
since backends can be so different in their implementation. For M3 I would be
totally fine with the extra egress that should be mitigated fairly considerably
by Snappy and the fact that HELP is common across certain metric families and
receiving it every single Remote Write request.

That's really a micro-optimisation. If you are that worried about bandwidth you'd run a sidecar specific to your remote backend that was stateful and far more efficient overall. Sending the full label names and values on every request is going to be far more than the overhead of metadata on top of that, so I don't see a need as it stands for any of this to be configurable.

Brian

Rob Skillington

unread,
Aug 8, 2020, 9:38:10 AM8/8/20
to Brian Brazil, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Sounds good, I've updated the proposal with details on places in which changes
are required given the new approach:
https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#

Rob Skillington

unread,
Aug 8, 2020, 11:22:03 AM8/8/20
to Brian Brazil, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Here's a draft PR that builds that propagates metadata to the WAL and the WAL
reader can read it back:
https://github.com/robskillington/prometheus/pull/1/files

Would like a little bit of feedback before on the datatypes and structure going
further if folks are open to that.

There's a few things not happening:
- Remote write queue manager does not use or send these extra fields yet.
- Head does not reset the "metadata" slice (not sure where "series" slice is
  reset in the head for pending series writes to WAL, want to do in same place).
- Metadata is not re-written on change yet.
- Tests.

Rob Skillington

unread,
Aug 10, 2020, 12:36:16 AM8/10/20
to Brian Brazil, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Update: The PR now sends the fields over remote write from the WAL and metadata
is also updated in the WAL when any field changes.

Now opened the PR against the primary repo:
https://github.com/prometheus/prometheus/pull/7771

I have tested this end-to-end with a modified M3 branch:
https://github.com/m3db/m3/compare/r/test-prometheus-metadata
> {... "msg":"received series","labels":"{__name__="prometheus_rule_group_...
> iterations_total",instance="localhost:9090",job="prometheus01",role=...
> "remote"}","type":"counter","unit":"","help":"The total number of scheduled...
> rule group evaluations, whether executed or missed."}

Tests still haven't been updated. Please any feedback on the approach /
data structures would be greatly appreciated.

Would be good to know what others thoughts are on next steps.

Callum Styan

unread,
Aug 10, 2020, 11:09:48 PM8/10/20
to Rob Skillington, Brian Brazil, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
I'm hesitant to add anything that significantly increases the network bandwidth usage or remote write while at the same time not giving users a way to tune the usage to their needs.

I agree with Brian that we don't want the protocol itself to become stateful by introducing something like negotiation. I'd also prefer not to introduce multiple ways to do things, though I'm hoping we can find a way to accommodate your use case while not ballooning average users network egress bill.

I am fine with forcing the consuming end to be somewhat stateful like in the case of Josh's PR where all metadata is sent periodically and must be stored by the remote storage system.

Overall I'd like to see some numbers regarding current network bandwidth of remote write, remote write with metadata via Josh's PR, and remote write with sending metadata for every series in a remote write payload.

Rob, I'll review your PR tomorrow but it looks like Julien and Brian may already have that covered.

Brian Brazil

unread,
Aug 11, 2020, 6:05:14 AM8/11/20
to Callum Styan, Rob Skillington, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On Tue, 11 Aug 2020 at 04:09, Callum Styan <callu...@gmail.com> wrote:
I'm hesitant to add anything that significantly increases the network bandwidth usage or remote write while at the same time not giving users a way to tune the usage to their needs.

I agree with Brian that we don't want the protocol itself to become stateful by introducing something like negotiation. I'd also prefer not to introduce multiple ways to do things, though I'm hoping we can find a way to accommodate your use case while not ballooning average users network egress bill.

I am fine with forcing the consuming end to be somewhat stateful like in the case of Josh's PR where all metadata is sent periodically and must be stored by the remote storage system.

 
Overall I'd like to see some numbers regarding current network bandwidth of remote write, remote write with metadata via Josh's PR, and remote write with sending metadata for every series in a remote write payload.

I agree, I noticed that in Rob's PR and had the same thought.

Brian

Julien Pivotto

unread,
Aug 11, 2020, 6:07:52 AM8/11/20
to Brian Brazil, Callum Styan, Rob Skillington, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On 11 Aug 11:05, Brian Brazil wrote:
> On Tue, 11 Aug 2020 at 04:09, Callum Styan <callu...@gmail.com> wrote:
>
> > I'm hesitant to add anything that significantly increases the network
> > bandwidth usage or remote write while at the same time not giving users a
> > way to tune the usage to their needs.
> >
> > I agree with Brian that we don't want the protocol itself to become
> > stateful by introducing something like negotiation. I'd also prefer not to
> > introduce multiple ways to do things, though I'm hoping we can find a way
> > to accommodate your use case while not ballooning average users network
> > egress bill.
> >
> > I am fine with forcing the consuming end to be somewhat stateful like in
> > the case of Josh's PR where all metadata is sent periodically and must be
> > stored by the remote storage system.
> >
>
>
>
> > Overall I'd like to see some numbers regarding current network bandwidth
> > of remote write, remote write with metadata via Josh's PR, and remote write
> > with sending metadata for every series in a remote write payload.
> >
>
> I agree, I noticed that in Rob's PR and had the same thought.

Remote bandwidth are likely to affect only people using remote write.

Getting a view on the on-disk size of the WAL would be great too, as
that will affect everyone.
> >>>>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> >>>>>>>> .
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Brian Brazil
> >>>>>>> www.robustperception.io
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Brian Brazil
> >>>>> www.robustperception.io
> >>>>>
> >>>> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Prometheus Developers" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an
> >> email to prometheus-devel...@googlegroups.com.
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com
> >> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> >> .
> >>
> >
>
> --
> Brian Brazil
> www.robustperception.io
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com.

--
Julien Pivotto
@roidelapluie

Brian Brazil

unread,
Aug 11, 2020, 6:15:33 AM8/11/20
to Brian Brazil, Callum Styan, Rob Skillington, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On Tue, 11 Aug 2020 at 11:07, Julien Pivotto <roidel...@prometheus.io> wrote:
On 11 Aug 11:05, Brian Brazil wrote:
> On Tue, 11 Aug 2020 at 04:09, Callum Styan <callu...@gmail.com> wrote:
>
> > I'm hesitant to add anything that significantly increases the network
> > bandwidth usage or remote write while at the same time not giving users a
> > way to tune the usage to their needs.
> >
> > I agree with Brian that we don't want the protocol itself to become
> > stateful by introducing something like negotiation. I'd also prefer not to
> > introduce multiple ways to do things, though I'm hoping we can find a way
> > to accommodate your use case while not ballooning average users network
> > egress bill.
> >
> > I am fine with forcing the consuming end to be somewhat stateful like in
> > the case of Josh's PR where all metadata is sent periodically and must be
> > stored by the remote storage system.
> >
>
>
>
> > Overall I'd like to see some numbers regarding current network bandwidth
> > of remote write, remote write with metadata via Josh's PR, and remote write
> > with sending metadata for every series in a remote write payload.
> >
>
> I agree, I noticed that in Rob's PR and had the same thought.

Remote bandwidth are likely to affect only people using remote write.

Getting a view on the on-disk size of the WAL would be great too, as
that will affect everyone.

I'm not worried about that, it's only really on series creation so won't be noticed unless you have really high levels of churn.

Brian

Rob Skillington

unread,
Aug 11, 2020, 11:55:18 AM8/11/20
to Brian Brazil, Callum Styan, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Agreed - I'll see what I can do in getting some numbers for a workload
collecting cAdvisor metrics, it seems to have a significant amount of HELP set:
https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics

Rob Skillington

unread,
Aug 19, 2020, 4:20:14 AM8/19/20
to Brian Brazil, Callum Styan, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
Here's the results from testing:
- node_exporter exporting 309 metrics each by turning on a lot of optional
  collectors, all have help set, very few have unit set
- running 8 on the host at 1s scrape interval, each with unique instance label
- steady state ~137kb/sec without this change
- steady state ~172kb/sec with this change
- roughly 30% increase

Graph here:
https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976

How do we want to proceed? This could be fairly close to the higher end of
the spectrum in terms of expected increase given the node_exporter metrics
density and fairly verbose metadata.

Even having said that however 30% is a fairly big increase and relatively large
egress cost to have to swallow without any way to back out of this behavior.

What do folks think of next steps?

Brian Brazil

unread,
Aug 19, 2020, 4:26:56 AM8/19/20
to Rob Skillington, Callum Styan, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On Wed, 19 Aug 2020 at 09:20, Rob Skillington <r...@chronosphere.io> wrote:
Here's the results from testing:
- node_exporter exporting 309 metrics each by turning on a lot of optional
  collectors, all have help set, very few have unit set
- running 8 on the host at 1s scrape interval, each with unique instance label
- steady state ~137kb/sec without this change
- steady state ~172kb/sec with this change
- roughly 30% increase

Graph here:
https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976

How do we want to proceed? This could be fairly close to the higher end of
the spectrum in terms of expected increase given the node_exporter metrics
density and fairly verbose metadata.

Even having said that however 30% is a fairly big increase and relatively large
egress cost to have to swallow without any way to back out of this behavior.

What do folks think of next steps?

It is on the high end, however this is going to be among the worst cases as there's not going to be a lot of per-metric cardinality from the node exporter. I bet if you greatly increased the number of targets (and reduced the scrape interval to compensate) it'd be more reasonable. I think this is just about okay.

Brian

Rob Skillington

unread,
Aug 19, 2020, 4:47:26 AM8/19/20
to Brian Brazil, Callum Styan, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
To add a bit more detail to that example, I was actually using a fairly tuned
remote write queue config that sent large batches since the batch send deadline
was set to 1 minute longer with a max samples per send of 5,000. Here's that
config:
```
remote_write:
  - url: http://localhost:3030/remote/write
    remote_timeout: 30s
    queue_config:
      capacity: 10000
      max_shards: 10
      min_shards: 3
      max_samples_per_send: 5000
      batch_send_deadline: 1m
      min_backoff: 50ms
      max_backoff: 1s
```

Using the default config we get worse utilization for both before/after numbers
but the delta/difference is less:
- steady state ~177kb/sec without this change
- steady state ~210kb/sec with this change
- roughly 20% increase

Using config:
```
remote_write:
  - url: http://localhost:3030/remote/write
    remote_timeout: 30s
```

Implicitly the values for this config is:
- min shards 1
- max shards 1000
- max samples per send 100
- capacity 500
- batch send deadline 5s
- min backoff 30ms
- max backoff 100ms

Brian Brazil

unread,
Aug 19, 2020, 4:53:08 AM8/19/20
to Rob Skillington, Callum Styan, Bjoern Rabenstein, Chris Marchbanks, Prometheus Developers
On Wed, 19 Aug 2020 at 09:47, Rob Skillington <r...@chronosphere.io> wrote:
To add a bit more detail to that example, I was actually using a fairly tuned
remote write queue config that sent large batches since the batch send deadline
was set to 1 minute longer with a max samples per send of 5,000. Here's that
config:
```
remote_write:
  - url: http://localhost:3030/remote/write
    remote_timeout: 30s
    queue_config:
      capacity: 10000
      max_shards: 10
      min_shards: 3
      max_samples_per_send: 5000
      batch_send_deadline: 1m
      min_backoff: 50ms
      max_backoff: 1s
```

Using the default config we get worse utilization for both before/after numbers
but the delta/difference is less:
- steady state ~177kb/sec without this change
- steady state ~210kb/sec with this change
- roughly 20% increase

I think 20% is okay all things considered.

Brian

Rob Skillington

unread,
Aug 19, 2020, 5:02:44 AM8/19/20
to Brian Brazil, Bjoern Rabenstein, Callum Styan, Chris Marchbanks, Prometheus Developers, Rob Skillington
If anyone wants to do some further testing on their own datasets, would definitely be interesting to see what range they are in.

I’ll start addressing latest round of comments and tie up tests.

Reply all
Reply to author
Forward
0 new messages