All Samples Lost when prometheus server return 500 to prometheus agent

339 views

Skip to first unread message

koly li

unread,

May 17, 2024, 5:02:49 AM5/17/24

to Prometheus Users

Hello all,

Recently we found that our samples are all lost. After some investigation, we found:

1, we are using prometheus agent to send all data to prometheus server by remote write

2, the agent sample sending code is in storage\remote\queue_manager.go, the function is sendWriteRequestWithBackoff()

3, inside the function, if attempt(the function where request is made to prometheus server) function returns an Recoverable Error, then it will retry sending the request

4, when a Recoverable error is returned? one scenario is the prometheus server returned 5xx error

5, I think not every 5xx error is recoverable, and there is no other way to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps retrying but every time it receives an 5xx from the server. so we lost all samples for hours until we restart the agent

So my question is why 5xx from the promtheus server is considered Recoverable? And I believe there should be a way to exit the loop, for example a maximum times to retry.

It seems that the agent mode is not mature enough to work in production.

Brian Candler

unread,

May 17, 2024, 8:15:04 AM5/17/24

to Prometheus Users

It's difficult to make sense of what you're saying. Without seeing logs from both the agent and the server while this problem was occurring (e.g. `journalctl -eu prometheus`), it's hard to know what was really happening. Also you need to say what exact versions of prometheus and the agent were running.

The fundamental issue here is, why should restarting the *agent* cause the prometheus *server* to stop returning 500 errors?

> So my question is why 5xx from the promtheus server is considered Recoverable?

It is by definition of the HTTP protocol: https://datatracker.ietf.org/doc/html/rfc2616#section-10.5

Actually it depends on exactly which 5xx error code you're talking about, but common 500 and 503 errors are generally transient, meaning there was a problem at the server and the request may succeed if tried again later. If the prometheus server wanted to tell the client that the request was invalid and could never possibly succeed, then it would return a 4xx error.

> And I believe there should be a way to exit the loop, for example a maximum times to retry.

You are saying that you would prefer the agent to throw away data, rather than hold onto the data and try again later when it may succeed. In this situation, retrying is normally the correct thing to do.

You may have come across a bug where a *particular* piece of data being sent by the agent was causing a *particular* version of prometheus to fail with a 5xx internal error every time. The logs should make it clear if this was happening.

koly li

unread,

May 19, 2024, 11:52:03 PM5/19/24

to Prometheus Users

Sorry for my poor description. Here is the story:

1) At first, we are using prometheus v2.47

Then we found all metrics are missing, we check the prometheus log and prometheus agent log:

prometheus log(lots of lines):

ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"

....

ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"

prometheus agent logs:

ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"

....

ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"

Then we check the codes:

https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77

The "too old sample" is considered an 500. And the agent keeps retrying (exit only when the error is not Recoverable, and 500 is considered Recoverable):

https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670

> You may have come across a bug where a *particular* piece of data being sent by the agent was causing a *particular* version of prometheus to fail with a 5xx internal error every time. The logs should make it clear if this was happening.

We guess there is one or more samples with too old timestamps which cause the problem. One or more samples with "too old timestamp" cause the prometheus agent to retry forever (agent receives 500) , which prevents new samples to be sent.

Because there is no logging about the incorrect samples, we cannot figure out what these samples are. It logs samples only for several errors: https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132

> The fundamental issue here is, why should restarting the *agent* cause the prometheus *server* to stop returning 500 errors?

We restart the agent 1~2 days after the problem occurs. The new data does not contain too old samples. That's why 500 errors disappear.

2) then we upgrade to v2.51

The new version returns 400 for "too old samples":

https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72

However, we encountered another 500:

prometheus agent log:

ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"prometheus\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"resourceType\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"namespace\" is not unique: invalid sample"

we modify the code to log samples, then we get the prometheus log:

ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"resourceId\" is not unique: invalid sample" series="{__name__=\"ovs_vswitchd_interface_resets_total\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", container=\"kube-rbac-proxy\", endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" timestamp=1715349156000
ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"resourceId\" is not unique: invalid sample"
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"service\" is not unique: invalid sample" series="{__name__=\"rest_client_request_size_bytes_bucket\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", pod=\"kube-scheduler-20230428-wangbo-dev14\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" timestamp=1715349164522
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"prometheus_replica\" is not unique: invalid sample" series="{__name__=\"workqueue_work_duration_seconds_sum\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", job=\"metrics\", name=\"ResourceSyncController\", namespace=\"service-ca-operator\", pod=\"service-ca-operator-645cfdbfb6-rjr4z\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" timestamp=1715349271085
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"prometheus_replica\" is not unique: invalid sample"

Currently we dont' know why there are duplicated labels. But when the server encounters duplicated labels, it returns 500. Then the agent keeps retrying, which means new samples cannot be handled.

we set external_labels in prometheus-agent configs:

global:
evaluation_interval: 30s
scrape_interval: 5m
scrape_timeout: 1m
external_labels:
clusterName: clustertest150
clusterRegion: region0
clusterZone: zone1
prometheus: ccos-monitoring/agent-0
prometheus_replica: prometheus-agent-0-0
keep_dropped_targets: 1

and the remote write config:

remote_write:
- url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
remote_timeout: 30s
name: prometheus-k8s-0
write_relabel_configs:
- target_label: __tmp_cluster_id__
replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
action: replace
- regex: __tmp_cluster_id__
action: labeldrop
bearer_token: XDFSDF...
tls_config:
insecure_skip_verify: true
queue_config:
capacity: 10000
min_shards: 1
max_shards: 500
max_samples_per_send: 2000
batch_send_deadline: 10s
min_backoff: 30ms
max_backoff: 5s
sample_age_limit: 5m

> You are saying that you would prefer the agent to throw away data, rather than hold onto the data and try again later when it may succeed. In this situation, retrying is normally the correct thing to do.

Yes, retry is the normal solution. But there should be maximum number of retries. We notice that prometheus agent sets the retry nubmers to the request header, but it seems the request header is not used by the server.

prometheus-agent sets the retry numbers to request header:

https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214

Besides, if some samples is incorrect and others are correct in the same request, why don't prometheus server save the correct part and drop the wrong part? It is more complicated as retry should be considered, but is it possible to save partial data and return 206 when the maximum number of retries is reached?

And should prometheus server log samples for all kinds of error?

https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133

Brian Candler

unread,

May 20, 2024, 1:37:51 AM5/20/24

to Prometheus Users

> server returned HTTP status 500 Internal Server Error: too old sample

This is not the server failing to process the data; it's the client supplying invalid data. You found that this has been fixed to a 400.

> server returned HTTP status 500 Internal Server Error: label name \"prometheus\" is not unique: invalid sample

I can't speak for the authors, but it looks to me like that should be a 400 as well.

koly li

unread,

May 20, 2024, 4:25:47 AM5/20/24

to Prometheus Users

Thanks a lot, Brian. I think I should create an issue.

Reply all

Reply to author

Forward

0 new messages