Promteheus HA different metrics

96 views
Skip to first unread message

Анастасия Зель

unread,
Sep 4, 2023, 10:49:25 AM9/4/23
to Prometheus Users

Hello, we use HA prometheus with two servers.
The problem is we get different metrics in dashboards from this two servers.
And we also scrape metrics from k8s, and some pods are not scraping because of error context deadline exceeded
Its differents pods on each server. In prometheus logs we dont see any of errors. How is that possible? What we can do for debug this?

prometheus, version 2.40.7 (branch: HEAD, revision: ab239ac5d43f6c1068f0d05283a0544576aaecf8) build user: root@afba4a8bd7cc build date: 20221214-08:49:43 go version: go1.19.4 platform: linux/amd64

prometheus config file
# This file is managed by ansible. Please don't edit it by hand or your changes would be overwritten.
#
# http://prometheus.io/docs/operating/configuration/

global:
  evaluation_interval: 30s
  scrape_interval: 30s
  scrape_timeout: 15s

  external_labels:
    null




rule_files:
  - /etc/prometheus/rules/*.rules

  - job_name: 'k8s_pods'
    scrape_interval: 5m
    scrape_timeout: 1m
    kubernetes_sd_configs:
      - role: pod
        api_server: https://x.x.x.x:6443
        tls_config:
          insecure_skip_verify: true
        bearer_token_file: "/etc/prometheus/kubernetes_bearer_token"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (.+):(?:\d+);(\d+)
        replacement: ${1}:${2}
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: kubernetes_pod_node_name

Brian Candler

unread,
Sep 4, 2023, 11:00:32 AM9/4/23
to Prometheus Users
On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:

Hello, we use HA prometheus with two servers.

You mean, two Prometheus servers with the same config, both scraping the same targets?

 

The problem is we get different metrics in dashboards from this two servers.

Small differences are to be expected.  That's because the two servers won't be scraping the targets at the same points in time.  If you see more significant differences, then please provide some examples.

 

And we also scrape metrics from k8s, and some pods are not scraping because of error context deadline exceeded

That basically means "scrape timed out".  The scrape hadn't completed within the "scrape_timeout:" value that you've set.  You'll need to look at your individual exporters and the failing scrape URLs: either the target is not reachable at all (e.g. firewalling or network configuration issue), or the target is taking too long to respond.
 

Its differents pods on each server. In prometheus logs we dont see any of errors.

Where *do* you see the "context deadline exceeded" errors then?

Ben Kochie

unread,
Sep 4, 2023, 11:06:33 AM9/4/23
to Brian Candler, Prometheus Users
Usually on the `/targets` page.

Prometheus does not log scrape errors by default. I would love this to be a configuration option, or even better, a per-job `scrape_configs` option.
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/34cf1354-9e58-4517-8c3d-3301d4fc0236n%40googlegroups.com.

Анастасия Зель

unread,
Sep 5, 2023, 1:09:55 AM9/5/23
to Prometheus Users
yes, i see errors on targets page in web interface.
I tried to increase timeout to 5 minutes and it changes nothing. 
Its strange because prometheus 2 always get this error on similar pods. And prometheus 1 never get this errors on this pods. 
понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler:

Brian Candler

unread,
Sep 5, 2023, 2:32:15 AM9/5/23
to Prometheus Users
Note that setting the scrape timeout longer than the scrape interval won't achieve anything.

I'd suggest you investigate by looking at the history of the "up" metric: this will go to zero on scrape failures.  Can you discern a pattern?  Is it only on a certain type of target, or targets running on a particular k8s node?  Is it intermittent across all targets, or some targets which fail 100% of the time?

If you compare the Targets page on both servers, are they scraping exactly the same URLs?  (That is, check whether service discovery is giving different results)

Анастасия Зель

unread,
Sep 5, 2023, 6:45:04 AM9/5/23
to Prometheus Users
Actually its targets on different k8s nodes, but the fail 100% of the time on that prometheus where its down. 
I get list of all down pods targets and noticed that number of down pods its the same on both prometheus nodes - 306 down pods targets. But its different targets :D
Yes, they scrape same urls of pods.
вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler:

Brian Candler

unread,
Sep 5, 2023, 7:06:30 AM9/5/23
to Prometheus Users
> the fail 100% of the time on that prometheus where its down

Then you're lucky: in principle it's straightforward to debug.
- get a shell on the affected prometheus server
- use "curl" to do a manual scrape of the target which is down (using the same URL that the Targets list shows)
- if it fails, then you've taken Prometheus out of the equation.

My best guesses would be (1) Network connectivity between the Prometheus server and the affected pods, or (2) service discovery is giving wrong information (i.e. you're scraping the wrong URL in the first place)

In case (2), I note that you're getting the targets to scrape from pod annotations. Look carefully at the values of those annotations, and how they are mapped into scrape address/port/path for the affected pods.

Анастасия Зель

unread,
Sep 5, 2023, 9:26:07 AM9/5/23
to Prometheus Users
yeah, i think scrape manually it will be useful but remember that its k8s pods :)
i only have pod ip and i cant get it from prometheus node because they are in different subnets. Pods subnet don't have access to outside network. 
so i dont know how i can scrape manually particular pod target from prometheus server.

but thank you for yours guesses, i will check it out
вторник, 5 сентября 2023 г. в 15:06:30 UTC+4, Brian Candler:

Stuart Clark

unread,
Sep 5, 2023, 9:41:31 AM9/5/23
to Анастасия Зель, Prometheus Users
On 2023-09-05 14:26, Анастасия Зель wrote:
> yeah, i think scrape manually it will be useful but remember that its
> k8s pods :)
> i only have pod ip and i cant get it from prometheus node because they
> are in different subnets. Pods subnet don't have access to outside
> network.
> so i dont know how i can scrape manually particular pod target from
> prometheus server.
>

That would explain why it isn't working. You need to have network
connectivity to all of your scrape targets from the Prometheus server.
So if you have configured Prometheus to scrape every pod (via the
Kubernetes SD for example) the Prometheus server will either need to be
inside the cluster or connected to the same network mechanism as the
pods.

--
Stuart Clark

Brian Candler

unread,
Sep 5, 2023, 10:08:31 AM9/5/23
to Prometheus Users
On Tuesday, 5 September 2023 at 14:26:07 UTC+1 Анастасия Зель wrote:
i only have pod ip and i cant get it from prometheus node because they are in different subnets.

Hosts on different subnets *could* talk to each other - that's what routers are for.

It's quite possible that you have a routing or network reachability issue, but you'll have to work out why you can reach some pods but not others.  That will be down to how your particular k8s cluster(s) have been built and configured.
Reply all
Reply to author
Forward
0 new messages