Disable remote write retry

576 views
Skip to first unread message

Ruben Papovyan

unread,
Sep 12, 2020, 3:33:41 PM9/12/20
to Prometheus Users
Hi team,
What are the options to disable remote write retry ?
Can I use following config to disable remote write retry ?
```
remote_write:
  queue_config: 
    min_backoff: 2h
    max_backoff: 2h
```
or if I need to retry 4 times can I use config ? 
```
remote_write:
  queue_config: 
    min_backoff: 30m
    max_backoff: 2h
```

What are recommendations ?

My guess here that after 2h WAL will be compacted and data will not be resend ?
Movement for this that i had network outage and cortex will not accept metrics(sample timestamp out of order) and it end up where prometheus ddosed cortex.


Thank you,
Ruben


Bartłomiej Płotka

unread,
Sep 13, 2020, 2:52:12 AM9/13/20
to Ruben Papovyan, Prometheus Users
Hey, 

Unless there is some bug on the receiving side (maybe your front proxy masking the actual status code) or Cortex - both Cortex and Thanos Receive in cases of not accepting write for reasons like this (something that there is no point retrying for) returns the status code that tells Prometheus to drop those requests and not retry.

Kind Regards,
Bartek Płotka (@bwplotka)


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ac8567aa-de1d-4e34-8074-8e8a924a9c30n%40googlegroups.com.

Ruben Papovyan

unread,
Sep 13, 2020, 5:31:35 PM9/13/20
to Prometheus Users
@bwplotka,
Thanks for your response 
I see errors in cortex distributer 400 and 500 errors
400 will NOT be sent again however 500 will be resend and it caused outage 

this is two types of errors that i see in distributor, no error logs in ingesters (only 400 errors in ingesters)

```
level=warn ts=2020-09-11T15:14:15.55091129Z caller=logging.go:62 traceID=1e80d0d72c7dfb18 msg="POST /api/prom/push (500) 11.40001159s Response: \"context canceled\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 74202; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.16.0; X-Forwarded-For: 10.254.178.57; X-Forwarded-Host: cortex.devops.app.umusic.net; X-Forwarded-Port: 80; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.254.178.57; X-Request-Id: aa786f8ba1483741acdcbb8503f9fb0d; X-Scheme: http; X-Scope-Orgid: eks-11; "
level=warn ts=2020-09-11T15:14:09.942532161Z caller=logging.go:62 traceID=69a628f39a21de24 msg="POST /api/prom/push (500) 6.100572749s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5908; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.13.1; X-Forwarded-For: 10.104.33.77; X-Forwarded-Host: cortex.devops.app.umusic.net; X-Forwarded-Port: 80; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.104.33.77; X-Request-Id: 3859a4b2f0e3b3badc281b95c9d7b852; X-Scheme: http; X-Scope-Orgid: eks-13; "
```

On prom log i see 400 so cortex gateway is not hiding real status code 
Prometheus logs:
ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=361 err="context canceled"
ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=60 err="context canceled"
ts=2020-09-11T15:32:05.635Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Failed to flush all samples on shutdown"
ts=2020-09-11T15:32:02.947Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=1000 err="server returned HTTP status 400 Bad Request: user=aws10-eks: sample timestamp out of order; last timestamp: 1599838222.874, incoming timestamp: 1599838162.874 for series {__name__=\"kube_pod_status_ready\", app_kubernetes_io_instance=\"kube-state-metrics\", app_kubernetes_io_managed_by=\"H"
ts=2020-09-11T15:32:02.665Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=1000 err="server returned HTTP status 400 Bad Request: user=aws10-eks: sample timestamp out of order; last timestamp: 1599838222.874, incoming timestamp: 1599838162.874 for series {__name__=\"kube_secret_info\", app_kubernetes_io_instance=\"kube-state-metrics\", app_kubernetes_io_managed_by=\"Helm\","
........
ts=2020-09-11T15:01:22.707Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage resharding" from=3 to=5
level=info ts=2020-09-11T15:00:08.014Z caller=head.go:731 component=tsdb msg="WAL checkpoint complete" first=232 last=234 duration=1.254153897s
level=info ts=2020-09-11T15:00:06.759Z caller=head.go:661 component=tsdb msg="head GC completed" duration=77.995686ms
level=info ts=2020-09-11T15:00:06.314Z caller=compact.go:496 component=tsdb msg="write block" mint=1599825600000 maxt=1599832800000 ulid=01EHYTWDPEX8SSCGBQT4PVCP95 duration=2.908463458s
ts=2020-09-11T14:36:42.706Z caller=dedupe.go:112 component=remote level=info remote_name=435af2 url=http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage resharding" from=2 to=3

I will be troubleshooting cortex installation and configuration 

But i also want to increase resend retries time so I don't end up in same situation.

What is right value for 30 min in prom config (    min_backoff: 30m ) is this right ? 

Im open if you have any recommendation for cortex (what can be misconfigured so i'm getting messages above in distributer )

Thank you,
Ruben
Reply all
Reply to author
Forward
0 new messages