What is happening and Why is it happening , Prometheus OOMed for small size cluster.

390 views
Skip to first unread message

sudhir...@gmail.com

unread,
Dec 8, 2017, 4:59:22 AM12/8/17
to Prometheus Users
Hi,
Any direction on debugging this problem and way to mitigate this problem will be appreciated much.
We have been running prometheus 1.6 with great success for over 1 year, as in cluster pod in kubernetes(openshift).
Over the past months we have started seeing Prometheus dropping out scrapes while it was at (1.6-1.8), the only thing changed was we added additional 5 nodes in cluster,
Quick googling helped us figure out it was memory issue for pod ,so we stared to tune it. But then 2.0 was recommended as it handles everything for you.
We migrated what we had for 2.0 but the pod died out even with large amount of memory given (25-30g).
Before we everything was scraped by only one prometheus, So we divided to prometheus job to be processed by multiple prometheus servers, each prometheus have only one job, so to understand which job is causing issue.

Now we see that the prometheus with only this config file is being repeatedly OOMed.
    global:
     scrape_interval: 60s
     scrape_timeout: 60s
    rule_files:
      - '/etc/prometheus/alerts/alert.rules'
    alerting:
     alertmanagers:
     - static_configs:
       - targets:
         - prometheus-alertmanager:9093
    scrape_configs:
    - job_name: 'kubernetes-nodes'
      tls_config:
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
      - role: node

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [container_label_io_kubernetes_pod_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [container_label_io_kubernetes_container_name]
        action: replace
        target_label: container_name


We have about Just 18 nodes , all of them baremetals. (mix of blade servers and standalone).  and we have around 250-300 containers, also we have autosacling enabled so the container varies when at peak load. The amount of metrics we see in prometheus is 50k-75k samples per second.. and The series count is around 4.0 Mil. The limit we gave the pod is 20G limit. The pods is stable for a time being but gets ooMed. We have kubernetes resource usage dashboard, which is loaded repeatedly in 15min interval. My curiosity is  it reasonable for getting 4mil series count for just 18 nodes, (250-300) containers , What is being reported by each nodes we are at kubernetes 1.5 and can we get number of sample scraped to be down.Our goal is to have view of Memory /CPU and network usage of each container via kubernetes. The statics about each node we get from node exporter ,which is scraped by another prometheus. I can provide other statics as well if need , but we need to find out why 4mil series are there and samples per sec swings between 45k-75k, for such a small size cluster.

Brian Brazil

unread,
Dec 8, 2017, 5:45:53 AM12/8/17
to sudhir...@gmail.com, Prometheus Users
That's a lot of series for 1.x, and 20GB is unlikely to be enough. https://www.robustperception.io/how-much-ram-does-my-prometheus-need-for-ingestion/ will give you an idea of what's going on here.

Brian
 

 
The pods is stable for a time being but gets ooMed. We have kubernetes resource usage dashboard, which is loaded repeatedly in 15min interval. My curiosity is  it reasonable for getting 4mil series count for just 18 nodes, (250-300) containers , What is being reported by each nodes we are at kubernetes 1.5 and can we get number of sample scraped to be down.Our goal is to have view of Memory /CPU and network usage of each container via kubernetes. The statics about each node we get from node exporter ,which is scraped by another prometheus. I can provide other statics as well if need , but we need to find out why 4mil series are there and samples per sec swings between 45k-75k, for such a small size cluster.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/508eae1f-2d3f-403d-9cc8-2b0c51470e97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

sudhir...@gmail.com

unread,
Dec 8, 2017, 6:18:16 AM12/8/17
to Prometheus Users
Hi Brian,

Thanks for the quick reply, we are now running 2.0, and both of the query in the article to investigate further does not seem to work.
Any other recommendations. I was interested in knowing why would a cluster of size 18 nodes , generates 4 mil series, is there any thing we can tune down to just get memory and Cpu utilization of container(pods) reported from /metrics end point of the those node.

With regards
Sudhir
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

Brian Brazil

unread,
Dec 8, 2017, 7:12:07 AM12/8/17
to sudhir pandey, Prometheus Users
On 8 December 2017 at 11:18, <sudhir...@gmail.com> wrote:
Hi Brian,

Thanks for the quick reply, we are now running 2.0, and both of the query in the article to investigate further does not seem to work.
Any other recommendations. I was interested in knowing why would a cluster of size 18 nodes , generates 4 mil series, is there any thing we can tune down to just get memory and Cpu utilization of container(pods) reported from /metrics end point of the those node.

With 300 containers and auto-scaling, that's not outside the realms of possibility.

Brian
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7f87b24e-289b-4a5d-841c-45cefad78207%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Ben Kochie

unread,
Dec 8, 2017, 7:56:51 AM12/8/17
to sudhir...@gmail.com, Prometheus Users
4M series from that small a cluster sounds slightly out of whack.  You might have something generating a lot of labels that it should't be.


To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7f87b24e-289b-4a5d-841c-45cefad78207%40googlegroups.com.

sudhir...@gmail.com

unread,
Dec 8, 2017, 10:13:03 AM12/8/17
to Prometheus Users
Thanks Ben, the tool work wonderfully well for the prometheus that has 50k-60k total series

But for the ones with 4Mil total series..the web page would get stuck and takes forever to load and utlmately get back with 502 error.
upon inspection this query would not load at all

api/v1/series?match[]=ALERTS&match[]=cadvisor_version_info&match[]=container_cpu_cfs_periods_total&match[]=container_cpu_cfs_throttled_periods_total&match[]=container_cpu_cfs_throttled_seconds_total&match[]=container_cpu_system_seconds_total&match[]=container_cpu_usage_seconds_total&match[]=container_cpu_user_seconds_total&match[]=container_fs_inodes_free&match[]=container_fs_inodes_total

Also i have seen that when you fire this query against the promvt, the backend prometheus servers will be killed. Possibly it tries to get everything out of it.

So i am wondering if we will be able to query prometheus to see what happeing without it getting it OOMed.

sudhir...@gmail.com

unread,
Dec 8, 2017, 11:45:14 AM12/8/17
to Prometheus Users
After deleting all the prometheus data, and starting a fresh pod.

I was finally able to get the prometheus visualization after 32 min of waiting.


All the outer rings are label ids of the container prefix with respective metrics.. So i am now wondering if there is something wrong with the /metrics end point that provides a different ids each time the prometheus scrapes. 
How ever our pod metrics in grafana is continious, and we are graphing then on basis of pod_name label.

Is it possible to know source of the id in the prometheus metric, where is it grabbed by the kubelet and added as label in the exposed metrics. Is it the docker id ? of the container or The id maintained by kubelet itself. We have been running this kubernets version 1.5 for a long time and we only started to see this problem (explosion) of metrics  after we added couple of nodes with same kubernetes version.. but i can see docker version is different on those nodes.

Brian Brazil

unread,
Dec 8, 2017, 11:51:02 AM12/8/17
to sudhir pandey, Prometheus Users
On 8 December 2017 at 16:45, <sudhir...@gmail.com> wrote:
After deleting all the prometheus data, and starting a fresh pod.

I was finally able to get the prometheus visualization after 32 min of waiting.


All the outer rings are label ids of the container prefix with respective metrics.. So i am now wondering if there is something wrong with the /metrics end point that provides a different ids each time the prometheus scrapes. 
How ever our pod metrics in grafana is continious, and we are graphing then on basis of pod_name label.

Is it possible to know source of the id in the prometheus metric, where is it grabbed by the kubelet and added as label in the exposed metrics. Is it the docker id ? of the container or The id maintained by kubelet itself. We have been running this kubernets version 1.5 for a long time and we only started to see this problem (explosion) of metrics  after we added couple of nodes with same kubernetes version.. but i can see docker version is different on those nodes.

This is likely correct, you have 23k containers across your nodes over the time Prometheus was monitoring. 1k containers per machine inside 30m is a lot, do you have a crashloop or something?

Brian
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2b614115-b102-4264-9e85-ce7d6ca4c095%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

sudhir...@gmail.com

unread,
Dec 8, 2017, 12:11:08 PM12/8/17
to Prometheus Users
This is what confusing me,
At the this time we have about 316 containers, and we dont have any crash looping ones. so for total of 197 metrics,  I was thinking 197*316 (around) so some where in 100k ball park. But prometheus is scraping millions of metrics.

Max number of container we have in a node is about 30. 


at 16:16 i wiped out all the storage , launched new stance, and ran the promvt to gather prometheus label stats. Prometheus-nodes is the one that only collecting metrics from kubelet end point.







--

Brian Brazil

unread,
Dec 8, 2017, 12:13:28 PM12/8/17
to sudhir pandey, Prometheus Users
On 8 December 2017 at 17:11, <sudhir...@gmail.com> wrote:
This is what confusing me, 
At the this time we have about 316 containers, and we dont have any crash looping ones. so for total of 197 metrics,  I was thinking 197*316 (around) so some where in 100k ball park. But prometheus is scraping millions of metrics.

I'd suggest poking around something like container_start_time_seconds and seeing where the discrepancy is.

Brian
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c3001784-3cb6-4155-8903-84bea06dc7bc%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

sudhir...@gmail.com

unread,
Dec 8, 2017, 12:36:48 PM12/8/17
to Prometheus Users
Thanks Brian,

I tried to look into by making a simple query 
count(container_start_time_seconds) by (id)

and it results on large number of results like these
{id="/system.slice/system-check_mk.slice/check_mk@30185-Ipaddres:6556-Ipaddress:34214.service"}

Seems to be the culprit, is there anyway we can drop the metrics with this kind of id pattern in the label while scraping.




--

Ben Kochie

unread,
Dec 8, 2017, 12:44:19 PM12/8/17
to sudhir pandey, Prometheus Users
Yes, you can use metric_relabel_configs to regexp keep/drop things.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5a293fe0-f10a-4871-b532-9f7a1c45f7ff%40googlegroups.com.

sudhir...@gmail.com

unread,
Dec 8, 2017, 1:04:43 PM12/8/17
to Prometheus Users
Thanks Ben, Brian for the support

Adding the following in the prometheus config file is now giving 1.6k scrapes per sec 120k total series as expected 


metric_relabel_configs:
 - source_labels: [id]    
    regex: 'system.slice/system-check_mk.slice/check_mk.*'    
    action: drop"

with regards
Sudhir

sudhir...@gmail.com

unread,
Dec 11, 2017, 6:39:52 AM12/11/17
to Prometheus Users
Just to add and finally reveal the source of problem for us was a systemd service from our monitoring agent (checkmk) , its systemd seems to create slices for each and very connection made to it via monitoring server..

When we finally made our prometheus server to escape those slices, system.slice/system-check_mk.slice/check_mk.*, 
We started to see that target nodes were dropping out (timing out ) while prometheus tried to query on metrics endpoint of kubernetes.


On node side we saw a huge amount of load itself in around of 300-500, and the process consuming all the cpu itself was kubernetes in worker node.
So we figured that even if we escape to store such queries in server side of prometheus , it was still problem on target nodes as each time prometheus queries the node, it has to go thorough all those slices that cadvisor sees as container and reports to prometheus and that build up the load as more records were added as  system.slice/system-check_mk as time passed by.

Prometheus has helped us reveal the problem and we are gratefull for that  :)

so in the end we turned of the systemd service  provided by the monitoring agent. and used xinetd service instead

Ben Kochie

unread,
Dec 11, 2017, 7:06:06 AM12/11/17
to sudhir pandey, Prometheus Users
I'd be interested to know more about this systemd config that was generating tons of cgroups.

Perhaps we should get cAdvisor to blacklist some of these things by default.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fee8e2fc-8657-4887-b9ae-1f45c57b5aa9%40googlegroups.com.

sudhir...@gmail.com

unread,
Dec 11, 2017, 8:20:10 AM12/11/17
to Prometheus Users

The systemd for this particular service looked like this

check_mk.socket
# systemd socket definition file
[Unit]
Description=Check_MK Agent Socket

[Socket]
ListenStream=6556
Accept=true

[Install]
WantedBy=sockets.target

and 
check_mk@.service
# systemd service definition file
[Unit]
Description=Check_MK

[Service]
ExecStart=/usr/bin/check_mk_agent
KillMode=process

User=root
Group=root

StandardInput=socket

we were starting check_mk.socket , that in turned started the checkmk.service , which i think ended up creating lots of cgroups ,each time it received the connection.

It would have been nice if we could have provided some blacklist at scrape time itself so cadvisor would not go through it at all.
Anyway we have a workaround for that one now and seems to work resonably well.
Reply all
Reply to author
Forward
0 new messages