Remote Write Stopped

237 views
Skip to first unread message

Ramachandra Bhaskar Ayalavarapu

unread,
Jan 21, 2022, 7:15:14 AM1/21/22
to Prometheus Users
Hello

Some of our prometheus instances running inside kubernetes are frequently not sending metrics to the remote_write destination. the remote_write destination is healthy and doesn't have problems from other clusters. The container is stable and not crashing and neither we are seeing any errors. please advise.

prometheus version: v2.32.0

RAM given to container: 12-48gb cpu: 2-12 cores. Kubernetes has close to 130+ nodes

remoteWrite:
        - url: "https://<redacted>/api/v1/write"
          queueConfig:
          capacity: 10000
          batchSendDeadline: 5s
          minShards: 1
          maxSamplesPerSend: 1000
          maxShards: 10
          basicAuth:
            username:
              name: kuberemotewrite
              key: username
            password:
              name: kuberemotewrite
              key: password

Please advise on how to debug further.


Julien Pivotto

unread,
Jan 21, 2022, 7:19:06 AM1/21/22
to Ramachandra Bhaskar Ayalavarapu, Prometheus Users
Hello,

Are you using the "Prometheus agent" feature? What are the command line
flags for Prometheus?

Regards,
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAHcJzTnQj9UvrG9JTEQJVpDsNRBYiiRtqn5etUdA7DOrE18uvA%40mail.gmail.com.

--
Julien Pivotto
@roidelapluie

Ramachandra Bhaskar Ayalavarapu

unread,
Jan 21, 2022, 7:23:29 AM1/21/22
to Ramachandra Bhaskar Ayalavarapu, Prometheus Users
We are not using agent. We are using as Prometheus service run by operator

Ramachandra Bhaskar Ayalavarapu

unread,
Jan 21, 2022, 12:30:02 PM1/21/22
to Ramachandra Bhaskar Ayalavarapu, Prometheus Users
we are not using as agent

On Fri, 21 Jan 2022 at 17:49, Julien Pivotto <roidel...@prometheus.io> wrote:

Ramachandra Bhaskar Ayalavarapu

unread,
Jan 23, 2022, 1:23:51 AM1/23/22
to Ramachandra Bhaskar Ayalavarapu, Prometheus Users
Please advise what all i can do to prevent/investigate remote writes from getting stopped.

Ramachandra Bhaskar Ayalavarapu

unread,
Jan 24, 2022, 1:04:20 PM1/24/22
to Ramachandra Bhaskar Ayalavarapu, Prometheus Users
Hello

We are seeing remote_write is crashing frequently. Any suggestions ?

Prometheus Version: v.2.32.0
Cluster has close to 190 nodes . We are NOT Running in "Agent" mode.

 remoteWrite:
    - basicAuth:
        password:
          key: password
          name: kuberemotewrite
        username:
          key: username
          name: kuberemotewrite
      queueConfig:
        batchSendDeadline: 1s
        capacity: 100000
        maxBackoff: 30s
        maxSamplesPerSend: 5000
        maxShards: 300
        minBackoff: 1s
        minShards: 20

ts=2022-01-24T17:48:12.343Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Starting WAL watcher" queue=284bcb
ts=2022-01-24T17:48:12.343Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Starting scraped metadata watcher"
ts=2022-01-24T17:48:12.345Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Replaying WAL" queue=284bcb
ts=2022-01-24T17:48:12.575Z caller=main.go:1166 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=260.590898ms db_storage=1.7µs remote_storage=8.009608ms web_handler=900ns query_engine=1.5µs scrape=488.694µs scrape_sd=67.369525ms notify=27.199µs notify_sd=1.457283ms rules=162.536628ms
ts=2022-01-24T17:48:12.810Z caller=main.go:1166 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=235.052893ms db_storage=1.5µs remote_storage=185.498µs web_handler=600ns query_engine=1µs scrape=171.198µs scrape_sd=68.57841ms notify=16.5µs notify_sd=1.452883ms rules=141.813767ms
ts=2022-01-24T17:50:22.464Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Done replaying WAL" duration=2m10.120135725s
ts=2022-01-24T17:52:32.345Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote storage resharding" from=20 to=30
ts=2022-01-24T17:53:12.346Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote storage resharding" from=30 to=40
ts=2022-01-24T17:54:02.345Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote storage resharding" from=40 to=53
ts=2022-01-24T17:54:32.345Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046869 minSendTimestamp=1643046870
ts=2022-01-24T17:54:42.345Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046869 minSendTimestamp=1643046880
ts=2022-01-24T17:54:52.345Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046884 minSendTimestamp=1643046890
ts=2022-01-24T17:55:02.347Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\": context canceled"
ts=2022-01-24T17:55:02.347Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=5000 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:55:02.348Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=198 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:55:02.348Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to flush all samples on shutdown" count=167431393
ts=2022-01-24T17:55:02.349Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Currently resharding, skipping."
ts=2022-01-24T17:55:12.345Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote storage resharding" from=53 to=90
ts=2022-01-24T17:55:52.344Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046945 minSendTimestamp=1643046950
ts=2022-01-24T17:56:12.344Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046966 minSendTimestamp=1643046970
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\": context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=5000 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=4448 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=4862 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=4573 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to flush all samples on shutdown" count=1234089315
ts=2022-01-24T17:56:22.344Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1643046979 minSendTimestamp=1643046980
ts=2022-01-24T17:56:32.344Z caller=dedupe.go:112 component=remote level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote storage resharding" from=90 to=209
ts=2022-01-24T17:57:32.346Z caller=dedupe.go:112 component=remote level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\": context canceled"
ts=2022-01-24T17:57:32.346Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=5000 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:57:32.347Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="non-recoverable error" count=4546 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:57:32.349Z caller=dedupe.go:112 component=remote level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed to flush all samples on shutdown" count=2884134481
Reply all
Reply to author
Forward
0 new messages