Alerts firing right after setting up kube-prometheus-stack

4,336 views
Skip to first unread message

M T

unread,
Dec 10, 2021, 6:40:10 PM12/10/21
to Prometheus Users
Hi all,

I'm just starting out with prometheus & family, and am trying to deploy it to my local development kubernetes cluster to play around with it a bit. I'm using the kube-prometheus-stack helm chart for this.

I've installed it with the following values:

namespaceOverride: "monitoring"
commonLabels:
  infrastructure: kube-prometheus-stack

grafana:
  namespaceOverride: "monitoring"

kube-state-metrics:
  namespaceOverride: "monitoring"

prometheus-node-exporter:
  namespaceOverride: "monitoring"
  hostRootFsMount: false  # necessary to avoid crashes for node-exporter on docker desktop

As you can already see in that last comment in my configuration, I'm using docker desktop (for windows).

When accessing the prometheus dashboard, I see the following alerts firing:
* TargetDown for: kube-proxy, kube-scheduler, kube-controller-manager, kube-etc
* etcdInsufficientMembers
* KubeControllerManagerDown
* KubeSchedulerDown

Notably: the targets listed in TargetDown are all in the namespace kube-system. I don't really know why they are marked as down, as I can see pods created by docker desktop which match them (pod names have the suffix -docker-desktop), so the pods are definitely there - I suspect that somehow, the services can't connect to them? And I suspect that the other 3 alerts are related to those targets being down.

Can anyone help me and provide some guidance and what I need to do to solve these alerts? That would be greatly appreciated!

Brian Candler

unread,
Dec 11, 2021, 4:07:45 AM12/11/21
to Prometheus Users
kube-prometheus-stack is not part of prometheus.  Therefore, your best chance of getting help is on a mailing list or tracker for that project.

From a prometheus point of view, all I can say is that you are scraping some targets and they are down.  You need to look at the prometheus scraping config and related service discovery to see what it's trying to scrape.  This configuration was built for you by the helm chart, so if it's wrong, the helm chart is wrong.

M T

unread,
Dec 11, 2021, 10:27:24 AM12/11/21
to Brian Candler, Prometheus Users
So the GitHub repo maintaining this helm chart is part of the prometheus-community organization on GitHub, and since it seems that they only want bugs & feature requests via GitHub issues, I followed the links of said prometheus-community to find some other forum to get usage help, assuming it would also cover the helm charts, based on the repo structure on Github - if this is not the correct mailing list, can you maybe point me to where else I should ask for help?

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/_aI-HySJ-xM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e6ebfd18-d446-45b6-8d22-40fc90ed7286n%40googlegroups.com.

Brian Candler

unread,
Dec 11, 2021, 1:21:50 PM12/11/21
to Prometheus Users
Are you referring to this one?

It doesn't seem to give any discussion forum link, but it does say that it installs
and I see there is a "Discussions" tab there.  That might be somewhere to ask.

M T

unread,
Dec 12, 2021, 9:45:09 AM12/12/21
to Prometheus Users
Well, this turned out to be a very educative debugging session - I'm posting my findings here in case someone else stumbles upon it :)
Only after having configured my first scraping job (ServiceMonitor Kubernetes CR) was I able to learn enough to actually debug stuff myself, and it not only taught me about prometheus configuration, but also about cluster management in Kubernetes, which - up to now - I had completely delegated to Docker Desktop!

Summary: kubeadm is the tool that Docker Desktop is using to create the cluster, and kubeadm had a couple of default configuration changes that clash with the helm chart - so depending on which kubeadm version your local cluster was created with, you will see different behavior. Some of that can be solved by configuring Prometheus correctly, but some other things actually need to re-configure your local cluster.

Disclaimer: this solution is geared towards the Kubernetes cluster offered by Docker Desktop for Windows, and deploying Prometheus using the prometheus-community/kube-prometheus-stack helm chart, but I'll try to point out how to make this more general. And even though I found Github issues that ended up describing exactly my problem quite early, I did not recognize that until I had gained enough knowledge about cluster management and Prometheus configuration to actually understand what they were about. If you already have enough knowledge of that, then https://github.com/prometheus-operator/kube-prometheus/issues/718 will contain all you need. If this issue is confusing, here's my explanation in a form that I had wished to encounter to not need to collect the required knowledge through other means :)

  • In general, there is a mismatch between the endpoints providing the metrics, and what Prometheus expects these endpoints to look like. The different failure modes I encountered while debugging these issues:
    • port mismatch
    • HTTP vs HTTPS
    • self-signed certificates failing verification
    • port binding to localhost, preventing Prometheus from scraping the endpoint even if all of the above is correct (since I'm also quite new into networking, I had no idea that something like this existed... hence my confusion about that Github issue!)

How to figure out each of the failure modes:

  1. Look at the Prometheus targets (easiest via the web UI, go to status -> targets) - this way, you'll see the error message that Prometheus encountered while trying to scrape, as well as the address + port it was attempting to scrape
    • connection refused: indicator for either port mismatch, port binding issues, or requesting HTTP for a HTTPS-only endpoint
    • certificate could not be verified: self-signed certificate
    • certificate for the wrong server: some other certificate issue
  2. Compare the things you see there with how the pods are configured:
    1. See the list of pods by calling kubectl -n kube-system get pod
    2. See the details of a specific pod by calling kubectl -n kube-system describe pod <podname> - in the case of Docker Desktop, these pod names should be:
      • kube-scheduler -> kube-scheduler-docker-desktop
      • kube-controller-manager -> kube-controller-manager-docker-desktop
      • etcd -> etcd-docker-desktop
      • kube-proxy -> kube-proxy-<some random string>
    3. The interesting bits necessary to figure out the correct metrics endpoint configuration are:
      • If some liveness or startup probes are present, check which port they are using -> the metrics endpoint will use the same port
      • In the command part, check if there's some command line flag for port binding with values like these - if yes, it means that Prometheus will always be refused, because it's not running on the same host:
        • kube-scheduler -> bind-address=127.0.0.1
        • kube-controller-manager -> bind-address=127.0.0.1
        • etcd -> listen-metrics-urls=http://127.0.0.1:<port>
Now, for fixing the different things! Following the order here makes fixing easier.
  • Wrong port number queried by Prometheus: Fixable by configuring Prometheus!
    • create a yaml containing your custom values for the helm chart, if you don't already have one
    • Add the following values to it:
      (kubeScheduler|kubeControllerManager|kubeEtcd|kubeProxy):  # Adapt for those that need adapting
        service:
          targetPort: <correct port>
  • Port binding refusing Prometheus by encountering one of the flags mentioned above: You need to change your cluster configuration by changing files created by Docker Desktop.... :( And Docker-Desktop really tries to hide how to do it! Kudos to https://stackoverflow.com/a/68971526 for telling me how to do it. NOTE: every cluster reset will override your changes there!
    • Figure out where the manifests are that control these settings - on Windows, it's here: \\wsl$\docker-desktop-data\version-pack-data\community\kubeadm\manifests
    • Edit these manifests as required, depending on which settings are active
      • kube-scheduler -> bind-address=127.0.0.1 change to bind-address=0.0.0.0  (only do this in production if that port is not accessible from outside of the cluster!)
      • kube-controller-manager -> bind-address=127.0.0.1 change to bind-address=0.0.0.0  (only do this in production if that port is not accessible from outside of the cluster!)
      • etcd -> listen-metrics-urls=http://127.0.0.1:<port> change to listen-metrics-urls=http://127.0.0.1:<port>,http://<cluster IP>:2381 - you can find the cluster IP in the other settings in that same file
    • Run kubectl -n kube-system get pod again and check the age of those pods - if they are older than your edits on those manifests, manually kill them by calling kubectl -n kube-system delete pod <pod name> (don't worry, Kubernetes will restart them for you, and upon restart, it will use the new version of the manifests!)
  • Prometheus still complains about "connection refused", even after fixing these port bindings: You need to change your cluster configuration again, but this time, it's enough to edit some ConfigMaps in your cluster at runtime. This step needs to happen after editing the Docker Desktop files, as any command line argument provided by Docker Desktop would override the changes in config files that are done here.
    • kubectl -n kube-system get configmap to get a list of configurations
    • kube-scheduler, kube-controller-manager and etcd read their configuration from kubeadm-config (if there is no such ConfigMap, check in the pod description again to see which path they use to mount the configuration file, and in the volumes section, what the name is that they expect for it).
    • kube-proxy reads its configuration from kube-proxy (again, check the pod configuration to see if the name might be different if you can't find it)
    • Check again that the pods are restarted after that change to pick up on them
  • Prometheus still complains about "connection refused": since both port & port binding are correct, this leaves only wrong protocol: Prometheus using HTTP requests, with the endpoints requiring HTTPS instead. This can be fixed by configuring Prometheus!
    • create a yaml containing your custom values for the helm chart, if you don't already have one
    • Add the following values to it:
      (kubeScheduler|kubeControllerManager|kubeEtcd|kubeProxy  ):  # Adapt for those that need adapting
        serviceMonitor:
          https: true
  • Prometheus complains about some certificate issue or another: This can be fixed by configuring Prometheus!
    • create a yaml containing your custom values for the helm chart, if you don't already have one
    • Add the following values to it:
      (kubeScheduler|kubeControllerManager|kubeEtcd|kubeProxy  ):  # Adapt for those that need adapting
        serviceMonitor:
          insecureSkipVerify: true
If you use some other way to deploy Prometheus instead of that helm chart, you'll need to check your deploy mechanism on how to configure the system scrape jobs.
If you're using Docker Desktop on some other OS than Windows, you'll need to figure out where Docker Desktop puts the manifests for the system pods.
If you're using somthing else than Docker Desktop, you'll need to check with that tools' documentation - chances are that it's easier to configure what kubeadm is doing, than in the case of Docker Desktop :)

These were the steps that I had to do to make the alerts go away. I hope this helps someone, should they stumble upon it :) And I also hope that this contains enough explanation for other newbies like me to not only blindly follow steps, but actually understand why they are necessary!

Happy monitoring :)

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/_aI-HySJ-xM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Dec 12, 2021, 1:17:49 PM12/12/21
to Prometheus Users
Awesome comprehensive write-up - thank you for taking the trouble.

Matthias Loibl

unread,
Dec 13, 2021, 6:27:26 AM12/13/21
to Prometheus Users
Hey,

for the kube-prometheus project the best place to find support and answers would be the Discussions on GitHub:
https://github.com/prometheus-operator/kube-prometheus/discussions

It sounds like you found a lot of really good things! 
If there are things in there that you think should be generally applicable for the kube-prometheus stack, then please feel free to open a couple of GitHub issues, to make this easier for others in the future! :)
Reply all
Reply to author
Forward
0 new messages