Prometheus job "kubernetes-nodes" endpoints in state "UNKNOWN"

404 views
Skip to first unread message

Lijing Zhang

unread,
Jun 11, 2019, 12:13:32 PM6/11/19
to Prometheus Users
Hi, 

We are facing one issue, that some endpoints are in state "UNKNOWN". Prometheus job "kubernetes-nodes".

Image 5.png


Could you please give advice, about the possible reason of such case?
Does this mean this node (name hide in red block...) has something wrong?

Thanks in advance.

- job_name: kubernetes-nodes
  scrape_interval
: 1m
  scrape_timeout
: 10s
  metrics_path
: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /
var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config
:
    ca_file
: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify
: true
  relabel_configs
:
 
- separator: ;
    regex
: __meta_kubernetes_node_label_(.+)
    replacement
: $1
    action
: labelmap
 
- separator: ;
    regex
: (.*)
    target_label
: __address__
    replacement
: kubernetes.default.svc:443
    action
: replace
 
- source_labels: [__meta_kubernetes_node_name]
    separator
: ;
    regex
: (.+)
    target_label
: __metrics_path__
    replacement
: /api/v1/nodes/${1}/proxy/metrics
    action
: replace
 
- source_labels: [__meta_kubernetes_namespace]
    separator
: ;
    regex
: (.*)
    target_label
: namespace
    replacement
: $1
    action
: replace


Chris Marchbanks

unread,
Jun 11, 2019, 11:18:04 PM6/11/19
to Lijing Zhang, Prometheus Users
Hello,

The "UNKNOWN" state occurs before a scrape has been attempted against that target. In this case, it is likely that either that node has come up within the last scrape interval, that Prometheus has recently been restarted, or that the Prometheus scrape configuration recently changed.

Chris
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e870a684-d6f7-46bc-8348-b28dcb260ae4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lijing Zhang

unread,
Jun 12, 2019, 10:59:26 PM6/12/19
to Prometheus Users
Thank you @ChrisMarchbanks.

I know before Prometheus does its first scrape, endpoint is in "UNKNOWN" state - well, things seems to be a little different. Those nodes came up many days ago, and so did Prometheus. 

We tried to manually curl those endpoints 
curl -H "Authorization: Bearer $(</var/run/secrets/kubernetes.io/serviceaccount/token)" https://kubernetes.default.svc:443/api/v1/nodes/<node_name>/proxy/metrics --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Metrics can be curled, but endpoint on Prometheus is still "UNKNOWN". This behavior keeps for several days.

We don't know if there is other criteria of such "UNKNOWN" state. Prometheus keeps considering them as "UNKNOWN" ans seems not able to get metrics from them...


在 2019年6月12日星期三 UTC+8上午11:18:04,Chris Marchbanks写道:
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Lijing Zhang

unread,
Jun 13, 2019, 3:27:51 AM6/13/19
to Prometheus Users
And, what puzzled me, is some nodes are in "UNKNOWN" and others are "UP". I paste a new picture

Image 11.png



在 2019年6月12日星期三 UTC+8上午11:18:04,Chris Marchbanks写道:
Hello,
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Simon Pasquier

unread,
Jun 13, 2019, 4:43:26 AM6/13/19
to Lijing Zhang, Prometheus Users
It could be that the associated Kubernetes resources are constantly being updated. Prometheus service discovery would generate new targets continuously and there wouldn't be enough time for the scraping to happen.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.

Lijing Zhang

unread,
Jun 20, 2019, 9:50:52 PM6/20/19
to Prometheus Users
Thanks for your precious idea, Simon. Is there any way to verify, that Kubernetes resources are constantly being updated?

在 2019年6月13日星期四 UTC+8下午4:43:26,Simon Pasquier写道:

Simon Pasquier

unread,
Jun 21, 2019, 3:58:03 AM6/21/19
to Lijing Zhang, Prometheus Users
You can look at the following query:
rate(prometheus_target_sync_length_seconds_count[5m])

It will tell you how often every scrape configuration is synchronized.

You can also look at:
rate(prometheus_sd_kubernetes_events_total[5m])

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/873286fe-ef20-4658-8c9a-c370f65f164d%40googlegroups.com.

Lijing Zhang

unread,
Jun 24, 2019, 1:36:37 AM6/24/19
to Prometheus Users
Thank you Simon. I checked these expressions, the first one is between 0.2 and 0.217, the second one is about 2.5
So, can we find something in these two diagrams, about why "UNKNOWN" in endpoints?

(1) 

Image 8.png


(2) 

Image 9.png





在 2019年6月21日星期五 UTC+8下午3:58:03,Simon Pasquier写道:
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/873286fe-ef20-4658-8c9a-c370f65f164d%40googlegroups.com.

Lijing Zhang

unread,
Jun 24, 2019, 4:07:36 AM6/24/19
to Prometheus Users
I guess - firstly target was in "UNKNOWN" state, then Prometheus started to scrape (target became "UP"), but suddenly for some reasons target were updated, and Prometheus treated it as a new target, and then wouldn't be enough time for scraping to happen, so we always see it as "UNKNOWN"...

So, is there any log on Kubernetes, that reflects targets have been updated?

Sincerely thanks...


在 2019年6月21日星期五 UTC+8下午3:58:03,Simon Pasquier写道:
You can look at the following query:
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/873286fe-ef20-4658-8c9a-c370f65f164d%40googlegroups.com.

Simon Pasquier

unread,
Jun 24, 2019, 9:01:43 AM6/24/19
to Lijing Zhang, Prometheus Users
Yes it looks like your targets are constantly being updated which means that Prometheus recreates the targets on and on.
You can start Prometheus with --log.level=debug and it may provide more details.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fbde1b20-cc19-4d64-8860-3bb6bf0b713c%40googlegroups.com.

Lijing Zhang

unread,
Jun 24, 2019, 9:46:23 PM6/24/19
to Prometheus Users
Thanks for your analyze, Simon. Please forgive me always keep asking...

"Yes it looks like your targets are constantly being updated which means that Prometheus recreates the targets on and on."
As we've found targets are constantly being updated, how can we avoid (or fix) this? Either in K8S or in Prometheus will be OK...

Sincerely...

在 2019年6月24日星期一 UTC+8下午9:01:43,Simon Pasquier写道:
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fbde1b20-cc19-4d64-8860-3bb6bf0b713c%40googlegroups.com.

Lijing Zhang

unread,
Jun 24, 2019, 11:16:48 PM6/24/19
to Prometheus Users
Also attach my log. I can find little about "kubernetes-nodes" and "nodes" except following ones,

level=debug ts=2019-06-24T05:59:55.457Z caller=scrape.go:923 component="scrape manager" scrape_pool=kubernetes-nodes target=https://kubernetes.default.svc:443/api/v1/nodes/shashirbac-worker-03/proxy/metrics msg="Scrape failed" err="Get https://kubernetes.default.svc:443/api/v1/nodes/shashirbac-worker-03/proxy/metrics: context deadline exceeded"
and 
{"log":"level=debug ts=2019-06-22T08:47:04.449Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg=\"github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:335: Watch close - *v1.Node total 622 items received\"\n","stream":"stderr","time":"2019-06-22T08:47:04.449674081Z"}
...
{"log":"level=debug ts=2019-06-22T08:55:09.451Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg=\"github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:335: Watch close - *v1.Node total 604 items received\"\n","stream":"stderr","time":"2019-06-22T08:55:09.45126979Z"}
...
{"log":"level=debug ts=2019-06-22T09:00:40.452Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg=\"github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:335: Watch close - *v1.Node total 412 items received\"\n","stream":"stderr","time":"2019-06-22T09:00:40.452885212Z"}
...



在 2019年6月24日星期一 UTC+8下午9:01:43,Simon Pasquier写道:
Yes it looks like your targets are constantly being updated which means that Prometheus recreates the targets on and on.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fbde1b20-cc19-4d64-8860-3bb6bf0b713c%40googlegroups.com.
debug-log.txt

Ming Li

unread,
Jun 25, 2019, 2:35:19 AM6/25/19
to Prometheus Users

For the metric rate(prometheus_sd_received_updates_total[5m]), we observe that its value is about 2 or 3. It means that each 2 or 3 second, one service discovery update will be sent to scrapeManager.

Our question is that why so frequent SD event is reported.



在 2019年6月25日星期二 UTC+8上午9:46:23,Lijing Zhang写道:

Simon Pasquier

unread,
Jun 25, 2019, 11:44:49 AM6/25/19
to Ming Li, Prometheus Users
On Tue, Jun 25, 2019 at 8:35 AM Ming Li <limin...@gmail.com> wrote:

For the metric rate(prometheus_sd_received_updates_total[5m]), we observe that its value is about 2 or 3. It means that each 2 or 3 second, one service discovery update will be sent to scrapeManager.

Our question is that why so frequent SD event is reported.


The Kubernetes SD gets the events from the Kubernetes API.
The scrapers are reloaded only if/when there is a change in the target labels.
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/db79acc5-2a6e-4707-abf2-3e78349a40b1%40googlegroups.com.
Message has been deleted

Ming Li

unread,
Jun 25, 2019, 11:17:01 PM6/25/19
to Prometheus Users
From the counter prometheus_sd_kubernetes_events_total as following:
     rate(prometheus_sd_kubernetes_events_total{event="update",role="node"}[5m])  1.2541666666666667

It implied that there is about 1.25 update event reported from kubernetes every second.
Per our understanding, node is not modified, there is no update event.
Can you share the information on the node update event and the reason why the update event is reported?


Ming Li

unread,
Jun 26, 2019, 2:01:28 AM6/26/19
to Prometheus Users
I suspect that the reload interval in scrape manager might be the root cause to cause UNKNOWN issue.

- scrape/manager.go 
func (m *Manager) reloader() {
        ticker := time.NewTicker(5 * time.Second)
        defer ticker.Stop()

        for {
                select {
                case <-m.graceShut:
                        return
                case <-ticker.C:
                        select {
                        case <-m.triggerReload:
                                m.reload()
                        case <-m.graceShut:
                                return
                        }
                }
        }
}

Current the interval is hardcoded as 5 seconds, it might be too short for scrap manager to finish the updating all the targets. Can it be enhanced to be configurable?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Simon Pasquier

unread,
Jul 3, 2019, 8:16:35 AM7/3/19
to Ming Li, Prometheus Users
First you need to determine why the target labels change constantly.
Increasing the reload interval would only hide the issue temporarily.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/49c0e33e-b31e-40e7-9c3d-84c92b541f6a%40googlegroups.com.

Ming Li

unread,
Jul 10, 2019, 3:49:10 AM7/10/19
to Prometheus Users
Thanks for advice. We now found that the UNKNOWN issue is triggered by the frequently changing target labels.
But i think that if scrapeLoop.run can do some below enhancements, such issue can be resolved also.

func (sl *scrapeLoop) run(interval, timeout time.Duration, errc chan<- error) {
        select {
        case <-time.After(sl.scraper.offset(interval, sl.jitterSeed)):
                // Continue after a scraping offset.
        case <-sl.scrapeCtx.Done():
                close(sl.stopped)
                return
        }

        var last time.Time

        ticker := time.NewTicker(interval)
        defer ticker.Stop()

mainLoop:
        for {
                select {
                case <-sl.ctx.Done():
                        close(sl.stopped)
                        return
                case <-sl.scrapeCtx.Done():
                        break mainLoop
                default:
                }

The above underscored lines are the root cause to incur the UNKNOWN state inside of Prometheus.
 
I am not sure whether the above lines can be removed so that even when target labels are changed frequently, the node state can be displayed.
Reply all
Reply to author
Forward
0 new messages