Best practice find slow query

129 views

Skip to first unread message

limi...@gmail.com

unread,

May 17, 2019, 1:32:14 PM5/17/19

to Prometheus Users

Hello Prometheus experts,

We are starting to use Prometheus since last October and it's awesome and stable, thanks for creating this!

Since our software released to production this year, it's working perfect until recently.

The query time and CPU/Memory consumption increased but we didn't introduce new serious(other than some pod rotation), till now we can't find the query cause the problem.

The troubleshooting we have been done include the following:

1. We are using Prometheus-operator to deploy Prometheus, the latest version is compatible with v2.7.2, so we upgrade from v2.5.0 to v2.7.2.

The problem seems gone and the CPU/Memory back to normal for like a week then the problem back again.

2. Checked some Prometheus internal metrics, attached some key metrics below:

It did show the Mem/CPU increased with the query time

3. Open debug logs, attached some logging sample below.

Not identify key information from it, saw from a github issue that Prometheus not expose slow query logs yet.

level=debug ts=2019-05-17T16:54:42.168017996Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"

level=debug ts=2019-05-17T16:54:42.170121992Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/services?resourceVersion=24797369&timeout=8m40s&timeoutSeconds=520&watch=true 200 OK in 1 milliseconds"

level=debug ts=2019-05-17T16:54:52.269476837Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"

level=debug ts=2019-05-17T16:54:52.32996453Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/pods?resourceVersion=25142787&timeout=6m0s&timeoutSeconds=360&watch=true 200 OK in 59 milliseconds"

level=debug ts=2019-05-17T16:54:52.615343281Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"

level=debug ts=2019-05-17T16:54:52.635172368Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/services?resourceVersion=24797369&timeout=9m25s&timeoutSeconds=565&watch=true 200 OK in 19 milliseconds"

level=debug ts=2019-05-17T16:54:53.493438475Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"

level=debug ts=2019-05-17T16:54:53.495189767Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/kube-system/pods?resourceVersion=25142787&timeout=6m50s&timeoutSeconds=410&watch=true 200 OK in 1 milliseconds"

level=debug ts=2019-05-17T16:54:58.626413142Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"

level=debug ts=2019-05-17T16:54:58.628175101Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/default/pods?resourceVersion=25242284&timeout=7m17s&timeoutSeconds=437&watch=true 200 OK in 1 milliseconds"

level=error ts=2019-05-17T16:55:31.949151341Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->10.0.98.133:50764: write: broken pipe"

level=error ts=2019-05-17T16:55:31.977217056Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->10.0.98.133:50758: write: broken pipe"

level=error ts=2019-05-17T16:55:31.991923795Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->10.0.98.133:50786: write: broken pipe"

level=debug ts=2019-05-17T16:55:54.827951493Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"

level=debug ts=2019-05-17T16:55:54.8373099Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?resourceVersion=25278160&timeout=7m44s&timeoutSeconds=464&watch=true 200 OK in 9 milliseconds"

level=warn ts=2019-05-17T16:55:54.837493003Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 25278160 (25278726)"

level=debug ts=2019-05-17T16:55:55.837642377Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"

level=debug ts=2019-05-17T16:55:55.839818229Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?limit=500&resourceVersion=0 200 OK in 2 milliseconds"

level=debug ts=2019-05-17T16:55:55.841480223Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?resourceVersion=25280464&timeout=7m54s&timeoutSeconds=474&watch=true 200 OK in 1 milliseconds"

level=debug ts=2019-05-17T16:56:39.672732588Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: forcing resync"

level=debug ts=2019-05-17T16:57:00.686226407Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"

level=debug ts=2019-05-17T16:57:00.687913369Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?resourceVersion=25278468&timeout=5m55s&timeoutSeconds=355&watch=true 200 OK in 1 milliseconds"

level=warn ts=2019-05-17T16:57:00.688149973Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 25278468 (25278836)"

level=debug ts=2019-05-17T16:57:01.688301933Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"

level=debug ts=2019-05-17T16:57:01.690457942Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?limit=500&resourceVersion=0 200 OK in 2 milliseconds"

level=debug ts=2019-05-17T16:57:01.692469494Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?resourceVersion=25280581&timeout=6m28s&timeoutSeconds=388&watch=true 200 OK in 1 milliseconds"

level=debug ts=2019-05-17T16:57:35.017591915Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 306 items received"

level=debug ts=2019-05-17T16:57:35.027129763Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/kube-system/endpoints?resourceVersion=25280638&timeout=7m39s&timeoutSeconds=459&watch=true 200 OK in 9 milliseconds"

level=debug ts=2019-05-17T16:58:01.203323354Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"

level=debug ts=2019-05-17T16:58:01.213971777Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/services?resourceVersion=24797369&timeout=9m44s&timeoutSeconds=584&watch=true 200 OK in 10 milliseconds"

level=debug ts=2019-05-17T16:59:21.431595235Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"

level=debug ts=2019-05-17T16:59:21.441100798Z caller=klog.go:70 component=k8s_client_runtime func=Infof msg="GET https://172.20.0.1:443/api/v1/namespaces/monitoring/pods?resourceVersion=25142787&timeout=8m25s&timeoutSeconds=505&watch=true 200 OK in 9 milliseconds"

level=error ts=2019-05-17T16:59:49.77703169Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->10.0.98.133:51904: write: broken pipe"

level=info ts=2019-05-17T17:00:03.594506257Z caller=compact.go:443 component=tsdb msg="write block" mint=1558101600000 maxt=1558108800000 ulid=01DB3BRWSDRAZE0CA29N1HXFTC

level=info ts=2019-05-17T17:00:04.49353602Z caller=head.go:526 component=tsdb msg="head GC completed" duration=211.926469ms

level=info ts=2019-05-17T17:00:08.512417508Z caller=head.go:573 component=tsdb msg="WAL checkpoint complete" first=5642 last=5650 duration=4.018819936s

level=debug ts=2019-05-17T17:00:23.151252507Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"

4. Disabled all alerts and make sure no one access the dashboard, the heap size drops.(Dumb debugging way)

Feel little lost about how to further troubleshooting this issue, a dumb way is disable this chart one by one and see how it goes...Any advice or guidence is appriciated, thanks in advance!

limi...@gmail.com

unread,

May 23, 2019, 12:52:35 PM5/23/19

to Prometheus Users

We found the problem, it's due to the query from one IP address.

Kaiming Wan

unread,

Mar 6, 2020, 11:29:43 PM3/6/20

to Prometheus Users

Hello, how can i get the grfana dashboard like you to monitor prometheus's inner metric itself. Can the dashboard config be shared?

Reply all

Reply to author

Forward

0 new messages