Sorry for my poor description. Here is the story:
1) At first, we are using prometheus v2.47
Then we found all metrics are missing, we check the prometheus log and prometheus agent log:
prometheus log(lots of lines):
ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
....
ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
prometheus agent logs:
ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
....
ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
Then we check the codes:
The "too old sample" is considered an 500. And the agent keeps retrying (exit only when the error is not Recoverable, and 500 is considered Recoverable):
> You may have come across a bug where a *particular* piece of data being sent by the agent was causing a *particular* version of prometheus to fail with a 5xx internal error every time. The logs should make it clear if this was happening.
We guess there is one or more samples with too old timestamps which cause the problem. One or more samples with "too old timestamp" cause the prometheus agent to retry forever (agent receives 500) , which prevents new samples to be sent.
>
The fundamental issue here is, why should restarting the *agent* cause the prometheus *server* to stop returning 500 errors?
We restart the agent 1~2 days after the problem occurs. The new data does not contain too old samples. That's why 500 errors disappear.
2) then we upgrade to v2.51
The new version returns 400 for "too old samples":
However, we encountered another 500:
prometheus agent log:
ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"prometheus\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"resourceType\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=
https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"namespace\" is not unique: invalid sample"
we modify the code to log samples, then we get the prometheus log:
ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"resourceId\" is not unique: invalid sample" series="{__name__=\"ovs_vswitchd_interface_resets_total\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", container=\"kube-rbac-proxy\", endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", if_name=\"veth99fa6555\", instance=\"
10.253.58.238:9983\", job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" timestamp=1715349156000
ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"resourceId\" is not unique: invalid sample"
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"service\" is not unique: invalid sample" series="{__name__=\"rest_client_request_size_bytes_bucket\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", host=\"
127.0.0.1:6443\", instance=\"
10.253.58.236:10259\", job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", pod=\"kube-scheduler-20230428-wangbo-dev14\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" timestamp=1715349164522
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error component=web msg="unknown error from remote write" err="label name \"prometheus_replica\" is not unique: invalid sample" series="{__name__=\"workqueue_work_duration_seconds_sum\", clusterName=\"clustertest150\", clusterRegion=\"region0\", clusterZone=\"zone1\", endpoint=\"https\", instance=\"
21.100.10.52:8443\", job=\"metrics\", name=\"ResourceSyncController\", namespace=\"service-ca-operator\", pod=\"service-ca-operator-645cfdbfb6-rjr4z\", prometheus=\"monitoring/agent-0\", prometheus_replica=\"prometheus-agent-0-0\", prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" timestamp=1715349271085
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="label name \"prometheus_replica\" is not unique: invalid sample"
Currently we dont' know why there are duplicated labels. But when the server encounters duplicated labels, it returns 500. Then the agent keeps retrying, which means new samples cannot be handled.
we set external_labels in prometheus-agent configs:
global:
evaluation_interval: 30s
scrape_interval: 5m
scrape_timeout: 1m
external_labels:
clusterName: clustertest150
clusterRegion: region0
clusterZone: zone1
prometheus: ccos-monitoring/agent-0
prometheus_replica: prometheus-agent-0-0
keep_dropped_targets: 1
and the remote write config:
remote_write:
- url:
https://prometheus-k8s-0.monitoring:9091/api/v1/write remote_timeout: 30s
name: prometheus-k8s-0
write_relabel_configs:
- target_label: __tmp_cluster_id__
replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
action: replace
- regex: __tmp_cluster_id__
action: labeldrop
bearer_token: XDFSDF...
tls_config:
insecure_skip_verify: true
queue_config:
capacity: 10000
min_shards: 1
max_shards: 500
max_samples_per_send: 2000
batch_send_deadline: 10s
min_backoff: 30ms
max_backoff: 5s
sample_age_limit: 5m
> You are saying that you would prefer the agent to throw away data, rather than hold onto the data and try again later when it may succeed. In this situation, retrying is normally the correct thing to do.
Yes, retry is the normal solution. But there should be maximum number of retries. We notice that prometheus agent sets the retry nubmers to the request header, but it seems the request header is not used by the server.
prometheus-agent sets the retry numbers to request header:
Besides, if some samples is incorrect and others are correct in the same request, why don't prometheus server save the correct part and drop the wrong part? It is more complicated as retry should be considered, but is it possible to save partial data and return 206 when the maximum number of retries is reached?
And should prometheus server log samples for all kinds of error?