Remote-write drop samples | design doc

410 views
Skip to first unread message

Harkishen Singh

unread,
Feb 26, 2021, 11:56:17 AM2/26/21
to Prometheus Developers
Hello everyone,

I had started to work on #7912 and have written a design doc for it. Please give your suggestions/feedbacks/improvements by commenting on the doc below.

Design doc link: click here

Thank you

Harkishen Singh

Tom Wilkie

unread,
Feb 27, 2021, 9:06:48 AM2/27/21
to Harkishen Singh, Prometheus Developers
Hi Harkishen! Thank you for the doc - I'm really excited to see more interest in Prometheus remote write.

We can go back and forth on the doc with comments, but perhaps it would be easier to have a chat over VC with Chris + I?   My main concerns are that we preserve the lossless nature of the remote write, and I worry limiting the number retries on 500s will undermine this.

Cheers

Tom

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5c2810d3-2617-48f5-b8af-c360d787cf41n%40googlegroups.com.

Harkishen Singh

unread,
Mar 1, 2021, 2:25:08 AM3/1/21
to Prometheus Developers
Hi Tom,

I have tried to answer the comments. Please comment on their satisfactoriness. I am happy for a call if required (or discussion gets tough).

I think, the lossless nature can be controlled by the user based on the config (limit_retries), and let the users have more control, as to whether they are happy to compromise a bit, if the retry is too much, since as such, if the retrying happens forever, then I don't think that is helpful (it will never be accepted by the remote storage). Also as Chris mentioned, some users might prefer to have few gaps and give more priority to recent data, like for alerting. So, I think this approach gives more flexibility to the user, at the same time, making it optional (or by setting the retry count high enough).

(apologies if I am wrong somewhere)

Thank you
Harkishen Singh

Stuart Clark

unread,
Mar 1, 2021, 5:13:21 AM3/1/21
to Harkishen Singh, Prometheus Developers
On 01/03/2021 07:25, Harkishen Singh wrote:
> Hi Tom,
>
> I have tried to answer the comments. Please comment on their
> satisfactoriness. I am happy for a call if required (or discussion
> gets tough).
>
> I think, the lossless nature can be controlled by the user based on
> the config (limit_retries), and let the users have more control, as to
> whether they are happy to compromise a bit, if the retry is too much,
> since as such, if the retrying happens forever, then I don't think
> that is helpful (it will never be accepted by the remote storage).
> Also as Chris mentioned, some users might prefer to have few gaps and
> give more priority to recent data, like for alerting. So, I think this
> approach gives more flexibility to the user, at the same time, making
> it optional (or by setting the retry count high enough).
>
Under what situations would retries happen forever?

If the receiver is available but cannot accept the data (for example due
to metric size limits or age of the samples) I would expect it to reject
with a 4XX code (permanent failure) which wouldn't trigger any retries.

Alternatively if the receiver is either unavailable or broken it could
result in "infinite" retries, but in that situation it feels like an age
based limit instead of retry limit would be better - a short retry limit
will reject samples that have just been scraped just as quickly as
samples that are days old. Instead it sounds like an age based limit
would be better - some systems have restrictions over what age can be
ingested (e.g. Timestream) or administrators could decide older data has
no usefulness (e.g. if the receiver is used for alerting or anomaly
detection. While the system should still reject such old samples once it
is working again a time based limit would at least reduce the network
impact once the receiver is back online (no need to send tons of data
that we know will be rejected).

--
Stuart Clark

Harkishen Singh

unread,
Mar 1, 2021, 5:31:04 AM3/1/21
to Prometheus Developers
Hey Stuart,

Thank you for your suggestion.

Yes, I think an age-based can be implemented as well. I think we should keep both max retry and age limit. Age limit would be helpful for time-based remote-storages, and non-age based can be in general (like non time-based storage systems) that will help in situations of too much congestion in the network.

Ben Kochie

unread,
Mar 1, 2021, 5:32:37 AM3/1/21
to Stuart Clark, Harkishen Singh, Prometheus Developers
If a remote write receiver is unable to ingest, wouldn't this be something to fix on the receiver side? The receiver could have a policy where it drops data rather than returning an error.

This way Prometheus sends, but doesn't have to need to know or deal with ingestion policies. It sends a bit more data over the wire, but that part is cheap compared to the ingestion costs.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Chris Marchbanks

unread,
Mar 1, 2021, 8:02:02 PM3/1/21
to Ben Kochie, Stuart Clark, Harkishen Singh, Prometheus Developers
Harkishen, thank you very much for the design document!

My initial thoughts are to agree with Stuart (as well as some users in the linked github issue) that it makes the most sense to start with dropping data that is older than some configured age. The default being to never drop data. For most outage scenarios I think this is the easiest to understand, and if there is an outage retrying old data x times still does not help you much.

There are a couple use cases that an age based solution doesn't solve ideally:
1. An issue where bad data is causing the upstream system to break, e.g. I have seen a system return a 5xx due to a null byte in a label value causing some sort of panic. This blocks Prometheus from being able to process any samples newer than that bad sample. Yes this is an issue with the remote storage, but it sucks when it happens and it would be nice to have an easy workaround while a fix goes into the remote system. In this scenario, only dropping old data still means you wouldn't be sending anything new for quite awhile, and if the bad data is persistent you would likely just end up 10minutes to an hour behind permanently (whatever you set the age to be).
2. Retrying 429 errors, a new feature currently behind a flag, but it could make sense to only retry 429s a couple of times (if you want to retry them at all) but then drop the data so that non-rate limited requests can proceed in the future.

I think to start with the above limitations are fine and the age based system is probably the way to go. I also wonder if it is worth defining a more generic "retry_policies" section of remote write that could contain different options for 5xx vs 429.

On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie <sup...@gmail.com> wrote:
If a remote write receiver is unable to ingest, wouldn't this be something to fix on the receiver side? The receiver could have a policy where it drops data rather than returning an error.

This way Prometheus sends, but doesn't have to need to know or deal with ingestion policies. It sends a bit more data over the wire, but that part is cheap compared to the ingestion costs.

I certainly see the argument that this could all be cast as a receiver-side issue, but I have also personally experienced outages that were much harder to recover from due to a thundering herd scenario once the service was restored. E.g. cortex distributors (where an ingestion policy would be implemented) effectively locking up or OOMing at a high enough request rate. Also, an administrator may not be able to update whatever remote storage solution they use. This becomes even more painful in a resource constrained environment. The solution right now is to go restart all of your Prometheus instances to indiscriminately drop data, I would prefer to be intentional about what data is dropped.

I would certainly be happy to jump on a call sometime with interested parties if that would be more efficient :)

Chris

Harkishen Singh

unread,
Mar 22, 2021, 7:47:10 AM3/22/21
to Prometheus Developers
Thank you everyone for the suggestions!

I agree with the age-based solutions, but such a solution is useful to particularly those systems that have a limitation on time. Many don't have that. But seeing the scenario, can we have both, so if users have a remote-storage system that respects time, then they can use the time-based dropping logic. If the user has a remote-storage that can accept a sample with any timestamp (past or future), he can use the retries count method. This will avoid recurring errors, like the null byte.

We can have something like LimitRetryPolicy as time or retries. If its time, we choose the max time (taken as input). If the policy is retries, then a count would be the input for the maximum retries. That way, we solve both the problems and leave it up to the user to consider it, based on the storage system he is using.

Does that look good to go, or we do just the age-based way?

Thank you

Stuart Clark

unread,
Mar 22, 2021, 11:03:14 AM3/22/21
to Harkishen Singh, Prometheus Developers
On 2021-03-22 11:47, Harkishen Singh wrote:
> Thank you everyone for the suggestions!
>
> I agree with the age-based solutions, but such a solution is useful to
> particularly those systems that have a limitation on time. Many don't
> have that. But seeing the scenario, can we have both, so if users have
> a remote-storage system that respects time, then they can use the
> time-based dropping logic. If the user has a remote-storage that can
> accept a sample with any timestamp (past or future), he can use the
> retries count method. This will avoid recurring errors, like the null
> byte.
>
> We can have something like LIMITRETRYPOLICY as TIME or RETRIES. If its
> TIME, we choose the max time (taken as input). If the policy is
> RETRIES, then a count would be the input for the maximum retries. That
> way, we solve both the problems and leave it up to the user to
> consider it, based on the storage system he is using.
>
> Does that look good to go, or we do just the age-based way?
>

The time based isn't just about handling remote write receivers than can
only ingest samples up to a certain age, but also to encapsulate policy
about what still matters.

Even if my receiver can ingest metrics from any time it is quite
possible that I don't care about data older than a certain period. For
example I might be doing something ML related that can be used for
autoremediation, so I want all the data but after 30 minutes it becomes
irrelivant. So even though it might accept older data I can set the
limit to 30 mins so Prometheus just drops it instead of trying to resend
(possibly unblocking more recent data in the process).


--
Stuart Clark
Reply all
Reply to author
Forward
0 new messages