RFC: Increasing Resistance to Remote Write Outages

41 views
Skip to first unread message

Robert Fratto

unread,
Feb 16, 2021, 1:43:25 PM2/16/21
to Prometheus Developers
Hi!

I've recently observed that the tolerance to a Remote Write outage is variable. The amount of lost data depends on the volume of existing data in WAL segments, which WAL segment is currently being tailed by Remote Write, and how much time remains before the next WAL checkpoint is created.

I've written a proposal to try to address this and make tolerance to an outage more predictable. Feedback
would be appreciated: https://docs.google.com/document/d/1DcaHoWZnA-N5UlQ7sJ0IlPKe4ul2nimWlzWaZSxPkNU/edit#

I plan to contribute the changes for this myself.

Best,
Robert

Harkishen Singh

unread,
Feb 17, 2021, 2:27:46 AM2/17/21
to Prometheus Developers
Sorry if I am wrong,  but won't the transaction based remote-write solve this issue?

Robert Fratto

unread,
Feb 17, 2021, 7:51:10 AM2/17/21
to Prometheus Developers
I'm not familiar with transaction based remote-write, can you point me at where work on that is happening?

Robert Fratto

unread,
Feb 17, 2021, 7:51:58 AM2/17/21
to Prometheus Developers
I've just finished adding an option 3 to the proposal that balances off tradeoffs between the previous two options: https://docs.google.com/document/d/1DcaHoWZnA-N5UlQ7sJ0IlPKe4ul2nimWlzWaZSxPkNU/edit?ts=602cd53c#heading=h.f1ppnnbncb4l

Julien Pivotto

unread,
Feb 17, 2021, 7:56:12 AM2/17/21
to Robert Fratto, Prometheus Developers
On 17 Feb 04:51, Robert Fratto wrote:
> I'm not familiar with transaction based remote-write, can you point me at
> where work on that is happening?

There is not work at the moment but the future goal is that TSDB
Commits() would be sent together, in the same remote write batch.

>
> On Wednesday, February 17, 2021 at 2:27:46 AM UTC-5 harkishe...@gmail.com
> wrote:
>
> > Sorry if I am wrong, but won't the transaction based remote-write solve
> > this issue?
> >
> > On Wednesday, February 17, 2021 at 12:13:25 AM UTC+5:30
> > robert...@gmail.com wrote:
> >
> >> Hi!
> >>
> >> I've recently observed that the tolerance to a Remote Write outage is
> >> variable. The amount of lost data depends on the volume of existing data in
> >> WAL segments, which WAL segment is currently being tailed by Remote Write,
> >> and how much time remains before the next WAL checkpoint is created.
> >>
> >> I've written a proposal to try to address this and make tolerance to an
> >> outage more predictable. Feedback
> >> would be appreciated:
> >> https://docs.google.com/document/d/1DcaHoWZnA-N5UlQ7sJ0IlPKe4ul2nimWlzWaZSxPkNU/edit#
> >>
> >> I plan to contribute the changes for this myself.
> >>
> >> Best,
> >> Robert
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/79b0345e-1b47-4b3b-b1cd-ffc41e7c763en%40googlegroups.com.


--
Julien Pivotto
@roidelapluie
Reply all
Reply to author
Forward
0 new messages