The query time and CPU/Memory consumption increased but we didn't introduce new serious(other than some pod rotation), till now we can't find the query cause the problem.
1. We are using Prometheus-operator to deploy Prometheus, the latest version is compatible with v2.7.2, so we upgrade from v2.5.0 to v2.7.2.
The problem seems gone and the CPU/Memory back to normal for like a week then the problem back again.
2. Checked some Prometheus internal metrics, attached some key metrics below:
3. Open debug logs, attached some logging sample below.
Not identify key information from it, saw from a github issue that Prometheus not expose slow query logs yet.
level=debug ts=2019-05-17T16:54:42.168017996Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"
level=debug ts=2019-05-17T16:54:52.269476837Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"
level=debug ts=2019-05-17T16:54:52.615343281Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"
level=debug ts=2019-05-17T16:54:53.493438475Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"
level=debug ts=2019-05-17T16:54:58.626413142Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"
level=error ts=2019-05-17T16:55:31.949151341Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->
10.0.98.133:50764: write: broken pipe"
level=error ts=2019-05-17T16:55:31.977217056Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->
10.0.98.133:50758: write: broken pipe"
level=error ts=2019-05-17T16:55:31.991923795Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->
10.0.98.133:50786: write: broken pipe"
level=debug ts=2019-05-17T16:55:54.827951493Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"
level=warn ts=2019-05-17T16:55:54.837493003Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 25278160 (25278726)"
level=debug ts=2019-05-17T16:55:55.837642377Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"
level=debug ts=2019-05-17T16:56:39.672732588Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: forcing resync"
level=debug ts=2019-05-17T16:57:00.686226407Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"
level=warn ts=2019-05-17T16:57:00.688149973Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 25278468 (25278836)"
level=debug ts=2019-05-17T16:57:01.688301933Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"
level=debug ts=2019-05-17T16:57:35.017591915Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 306 items received"
level=debug ts=2019-05-17T16:58:01.203323354Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service total 0 items received"
level=debug ts=2019-05-17T16:59:21.431595235Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod total 0 items received"
level=error ts=2019-05-17T16:59:49.77703169Z caller=api.go:1099 component=web msg="error writing response" bytesWritten=0 err="write tcp 10.0.77.181:9090->
10.0.98.133:51904: write: broken pipe"
level=info ts=2019-05-17T17:00:03.594506257Z caller=compact.go:443 component=tsdb msg="write block" mint=1558101600000 maxt=1558108800000 ulid=01DB3BRWSDRAZE0CA29N1HXFTC
level=info ts=2019-05-17T17:00:04.49353602Z caller=head.go:526 component=tsdb msg="head GC completed" duration=211.926469ms
level=info ts=2019-05-17T17:00:08.512417508Z caller=head.go:573 component=tsdb msg="WAL checkpoint complete" first=5642 last=5650 duration=4.018819936s
level=debug ts=2019-05-17T17:00:23.151252507Z caller=klog.go:53 component=k8s_client_runtime func=Verbose.Infof msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - *v1.Endpoints total 0 items received"
4. Disabled all alerts and make sure no one access the dashboard, the heap size drops.(Dumb debugging way)
Feel little lost about how to further troubleshooting this issue, a dumb way is disable this chart one by one and see how it goes...Any advice or guidence is appriciated, thanks in advance!